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1  Overview 


In  the  past  decade,  there  has  been  a  great  deal  of  progress  in  the  area  of  nonlinear 
games  under  both  full  and  partial  observations  (i.e.  both  state  feedback  and  measurement 
feedback).  These  advances  have  been  motivated  in  part  by  applications  to  robust/H- 
inhnity  control  and  estimation,  but  have  obvious  application  in  the  area  of  command  and 
control  due  to  the  adversarial  aspects  of  the  battlefield.  The  area  of  control  of  stochastic 
processes  is  more  well-developed,  and  also  has  obvious  applications  in  command  and 
control  due  to  the  random  components  of  a  conflict.  The  planned  experiments  will  make 
use  of  these  techniques  with  the  goal  of  proving  (or  disproving)  the  advantages  that  would 
be  obtained  if  these  techniques  were  employed.  Members  of  the  NCSU  team  are  at  the 
forefront  of  these  areas,  and  so  have  unique  capabilities  to  develop  such  technologies  and 
experiments. 

Although  it  seems  obvious  that  the  modeling  of  the  enemy  activities  as  controlled  by 
an  intelligent,  antagonistic  player  would  lead  to  better  command  and  control  decisions, 
there  are  a  number  of  mitigating  factors.  In  particular,  there  are  a  number  of  simplifying 
assumptions  and  sub-optimal  techniques  which  must  be  employed  in  order  to  make  the 
problem  computationally  tractable.  The  question  is  then  whether  the  advantages  are  still 
significant  under  these  conditions.  It  is  this  team’s  belief  that  these  will  be  evident  in 
real-world  applications  as  well  as  in  simulations  that  reasonably  model  the  opponent’s 
command  decisions,  but  this  needs  to  be  proven.  (Here  we  are  using  “proven”,  not  in 
the  rigorous  mathematical  sense,  but  in  the  sense  of  being  reasonably  demonstrable  via 
multiple  simulations.) 

As  alluded  to  above,  an  obvious  difficulty  with  the  application  of  advanced  techniques 
is  the  possible  computational  burden.  One  approach  to  the  reduction  of  this  burden  is  the 
use  of  hierarchical  decompositions  of  the  problem.  Techniques  for  such  decomposition  in 
agile  manufacturing  applications  have  also  undergone  significant  development  in  the  last 
decade  (and  as  above,  members  of  the  NCSU  team  have  been  at  the  forefront  of  this). 
Although  these  techniques  have  largely  been  developed  in  the  context  of  deterministic 
and  stochastic  models,  we  have  expanded  these  to  the  (stochastic)  game  context  discussed 
above. 

An  important  component  of  higher  level  military  operations  is  resource  (such  as  air¬ 
craft)  allocation.  Activities  such  as  distribution  and  re-distribution  of  resources  among 
a  number  of  geographical  regions  can  be  naturally  modeled  as  a  discrete-time  Markov 
decision  process  (MDP).  A  major  advantage  of  the  MDP  model  lies  in  its  capability  to 
capture  events  evolving  in  a  discrete  fashion.  The  drawback  of  such  model,  however,  it 
its  inherent  large  dimensionality.  We  attack  this  problem  using  a  recent  development  on 
singular  perturbations  of  finite-state  Markov  chains.  In  military  control  operations,  it 
is  common  that  some  variables  change  more  rapidly  than  other  variables.  This  leads  to 
time  scale  decomposition.  The  concept  used  in  attacking  the  MPD  problem  is  to  make 
use  of  time  scales  to  classify  the  states  into  several  groups  such  that  the  MDP  jumps  more 
frequently  within  a  group  and  less  frequently  among  groups. 
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At  the  lowest  level,  some  minimal  cost  aircraft  routes  to  the  eventual  targets  are 
mapped  out.  Some  inverse  Lyapunov  techniques  as  well  as  optimization  approaches  are 
used  for  rapid  generation  of  these  routes.  These  routes  are  then  used  to  determine  SAM 
sites  (possibly  decoys)  that  are  unavoidable.  One  then  employs  a  discrete  stochastic  game 
problem  formulation  to  determine  which  of  these  SAMs  should  optimally  be  engaged,  and 
by  what  series  of  aircraft  operations.  Since  this  is  a  game  model,  the  optimal  opponent 
strategy  is  also  determined.  Assuming  perfect  knowledge  of  the  state  of  the  system, 
one  obtains  these  optimal  stratgies  (for  both  sides),  via  dynamic  programming  methods. 
The  NCSU  team  uses  an  exit  cost  criterion;  that  is,  the  game  solution  is  solved  all  the 
way  to  the  various  end  states.  Some  technical  shortcuts  are  used  to  allow  us  to  obtain 
this  solution.  Obtaining  the  complete  solution  to  the  problems  at  the  lowest  level  in 
the  hierarchy  is  superior  to  the  rolling  horizon  (euphemistally  referred  to  as  the  model 
predictive  control)  approach  for  obvious  reasons.  Further,  it  allows  us  to  run  Monte 
Carlo  simulations  with  optimal  controls  and  produce  plots  of  the  expected  outcome  and 
its  variance  as  functions  of  various  parameters.  This  alows  the  commander  to  quickly, 
visually  see  when  the  situation  is  nearing  a  point  where  the  optimal  strategy  makes  a 
sudden  jump.  (This  will  be  shown  below.) 

Much  of  the  work  done  on  C2  for  Air  Operations  has  assumed  perfect  knowledge  of  the 
state  of  the  system  (battle).  However,  it  is  well  known  that  the  “fog  of  war”  is  a  major 
aspect  of  most  conflicts.  Consequently,  the  NCSU  team  has  recently  been  addressing 
the  problem  of  control  under  imperfect,  and  even  misleading  information.  This  involves 
formulation  of  the  problem  as  a  stochastic  game  under  partial  information.  This  is  a 
problem  which  is  at  the  edge  of  current  understanding  in  the  held  of  control.  We  make  use 
of  an  approach  which  is  optimal  under  limited  conditions,  and  have  shown  that  it  leads 
to  significantly  better  results  than  the  standard  techniques  (such  as  extended  Kalman 
filtering) . 

Summarizing,  we  apply  the  following  techniques: 

•  Robust/Game-Theoretic  Control  (with  stochastic  components) 

•  Robust/Game-Theoretic  Estimation  (and  the  combined  estimation/control  prob¬ 
lem) 

•  Hierarchical  Decomposition  Methods 

•  Inverse  Lyapunov  Techniques 

•  Dynamic  Programming. 


In  Section  2,  we  consider  our  underlying  “small”  stochastic  game  problem  involving 
only  a  few  aircraft,  a  handful  of  SAMs  and  a  target.  The  solution  of  the  problem  at 
this  level  underlies  the  solution  of  larger  problems  at  higher  levels.  Consequently,  a  good 
deal  of  work  was  devoted  to  a  solid  understanding  of  this  problem.  The  analysis  and 
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results  are  described  in  detail  in  the  various  subsections.  In  order  for  the  aircraft  to 
reach  their  destinations,  and  also  as  a  means  for  flagging  potential  threats  en  route  which 
may  need  to  be  dealt  with,  one  needs  a  tool  for  generating  reasonable  aircraft  routes. 
This  is  discussed  in  section  3.  Once  one  has  solved  the  above  small  games,  one  needs  to 
enlarge  the  view  to  much  more  substantial  problems  in  terms  of  the  number  of  entities 
(on  both  sides)  involved.  This  is  done  via  a  hierarchical  technique,  and  a  discussion  of 
this  technique  appears  in  Section  4.  Now,  all  of  the  above  assumes  perfect  knowledge  of 
the  system.  The  consequences  of  partial,  imperfect  and  corrupted  information  are  studied 
in  Section  5.  In  that  section,  both  the  estimation  problem,  and  the  problem  of  control 
under  imperfect  information  (in  the  presence  of  an  adversary)  are  discussed.  Section  6 
discusses  a  study  of  filtering  at  a  higher  level  in  the  hierarchy.  Lastly,  Section  7  moves 
the  discussion  to  yet  a  higher  level  where  a  commander  may  be  attempting  to  determine 
which  control  technique  (of  many  being  olfered)  to  employ.  The  choice  may  depend  on  the 
situation,  and  one  may  even  consider  a  switching  meta-controller  which  chooses  between 
different  control  algorithms  depending  on  the  current  state. 


cy 

2  Solution  of  Stochastic  Games  for  C 

This  section  deals  with  the  lowest  level,  where  the  problem  has  been  reduced  to  a  small 
stochastic  game  involving  only  a  few  entities.  (In  the  results  to  appear  inthis  section, 
there  are  only  two  aircraft,  three  SAMs  and  an  enemy  target;  in  the  newer  software  for 
the  imperfect  information  case  (Section  5),  we  have  increased  this  to  include  at  least  six 
SAMs  and  two  targets  as  well  as  decoys.  Even  that  problem  size  could  easily  be  doubled 
with  today’s  available  computational  power,  this  study  was  meant  only  as  a  demonstration 
of  approach,  and  so  no  such  effort  was  made.  Note  also  that  as  indicated  above,  we  use 
the  hierarchical  technique  to  deal  with  the  full  scale  problem  -  doubling  problem  size  at 
each  increasing  level  of  the  hierarchy.) 

The  objects  which  will  be  of  interest  in  this  section  are  aircraft  (belonging  to  what  will 
be  termed  the  “blue”  player),  SAMs  (belonging  to  the  “red”  player),  and  strategic  targets 
(belonging  to  red).  The  usage  of  blue  and  red  desigantions  will  be  assumed  throughout 
the  document. 

We  will  reduce  the  state  of  the  ith  (blue)  aircraft  at  time  t  to  a  pair,  Y^{t)  = 
(Df(t),Xf(t))  where  Df  will  represent  the  health  status  of  the  aircraft,  and  Xf  will 
represent  its  position.  Note  that  since  the  scope  of  the  C2  problem  is  large,  we  will  not 
model  the  dynamics  of  each  aircraft  in  detail;  we  will  not  include  velocity,  attitude,  mass 
and  so  forth  as  part  of  the  state.  When  the  problem  is  decomposed  into  separate  sub¬ 
problems  below,  we  will  abuse  notation  in  the  sense  that  in  one  subproblem,  Xf  will 
represent  a  position  taking  continuous  values  in  R2,  while  for  the  other  subproblem  this 
will  indicate  position  among  a  discrete  set  of  alternatives;  the  meaning  will  be  completely 
obvious  by  context.  We  will  suppose  that  the  health  status  take  values  in  the  discrete  set 
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{1,  2,  3, 4}  where  1  represents  healthy,  2  and  3  represent  various  levels  of  damage  (or  need 
of  maintenance),  and  4  indicates  that  the  aircraft  has  been  destroyed. 

We  will  assume  similar  state  models  for  the  SAMs.  The  ith  SAM  state  will  be  rep¬ 
resented  by  the  pair  YR{t)  =  (Df{t),XR{t))  where  Df  will  represent  the  health  status 
of  the  SAM,  and  X R  will  represent  its  position.  (Note  that  there  exist  both  mobile  and 
fixed-site  SAMs.)  Similar  comments  as  those  above  can  be  made  with  regard  to  XR.  As 
for  the  health  status  of  the  SAMs,  we  let  DR  take  values  in  {1,  2,  3}  where  1  represents 
healthy,  2  represents  damage  (or  need  of  maintenance),  and  3  indicates  that  the  SAM 
has  been  destroyed  (not  repairable) .  Lastly,  we  will  take  a  similar  model  for  the  strategic 
targets,  where  the  pair  will  be  denoted  as  YR  (t)  =  ( Dj (t),Xf  (£))  with  Dj ( t )  €  {1,  2,  3}, 
where  1,  2  and  3  will  have  the  same  meaning  as  for  the  SAMs.  Let  the  number  of  blue 
aircraft  be  Na,  the  number  of  red  SAMs  be  Nr,  and  the  number  of  red  strategic  targets 
be  Nt.  Let  YA  =  {YA}^,  YR  =  {Y*}^  and  YT  =  {YR}^V  Throughout,  we  will 
use  the  convention  of  uppercase  letters  for  the  state  processes  and  lowercase  for  values 
that  the  state  process  may  take  on,  that  is,  YA(t)  =  yA  indicates  that  the  aircraft  state 
process  has  the  value  yA  at  time  t. 

The  objective  is  not  clearly  defined  in  a  mathematical  sense.  For  blue,  it  may  some¬ 
times  be  to  destroy  a  strategic  target  while  minimizing  damage  to  the  aircraft;  in  other 
situations  it  may  be  more  general  attrition  of  both  SAMs  and  targets.  In  order  to  sim¬ 
plify  matters,  we  will  assume  here  that  both  players  are  using  the  same  objective  function. 
That  is,  blue  is  trying  to  minimize  the  worst  case  (maximum)  payoff,  and  red  is  trying  to 
maximize  their  worst  case  (minimum)  of  the  same  payoff.  The  time-horizon  over  which 
these  objectives  should  be  met  is  not  fixed.  We  choose  to  consider  an  exit  cost,  without 
running  cost  terms.  Let  r  be  the  exit  time.  We  define  the  exit  time  to  be  the  time 
when  either:  1)  all  the  blue  aircraft  have  been  destroyed  or  2)  the  red  strategic  target (s) 
has(have)  been  destroyed  and  the  surviving  blue  aircraft  have  returned  to  base.  We  let 
the  set  of  states  satisfying  one  of  the  exit  conditions  be  denoted  by  £.  In  order  to  capture 

the  objective  in  a  reasonably  simple  payoff  function,  one  can  consider,  for  instance,  a 

linear  payoff  with  parameters  which  can  be  varied  depending  on  the  value  of  the  assets 
such  as 

N a  Nr  Nr 

^ ( yA ,  yR,  f  )  =  ha  [Y  dt\  ~  vr  [E  df\  ~  I1?  [Y  dJ]  (!) 

i— 1  i— 1  i— 1 

where  ha,  Hr,  Ht  are  the  parameters.  The  presence  of  the  expectation  in  the  above  equa¬ 
tion  is  due  to  the  fact  that  the  dynamics  of  the  health  status  of  the  objects  will  involve 
random  outcomes  of  engagements  and  maintenance. 

The  next  two  subsections  describe  the  mathematics  behind  the  algorithm.  The  reader 
interested  primarily  in  the  application  of  the  algorithm,  and  its  advantages  should  proceed 
to  Subsection  2.3. 
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2.1  Discrete  Stochastic  Game 


We  consider  the  problem  where  a  single  strategic  target  is  selected,  and  an  approximate 
path  from  the  blue  base  to  that  target  has  been  generated.  As  discussed  above,  there  may 
be  one  or  more  SAM  sites  intervening  along  this  path.  At  this  level,  the  positional  dynam¬ 
ics  will  be  specified  only  in  a  general  way.  Let  the  SAMs  be  indexed  as  {1,  2,  3, ...,  Nr}. 
Let  the  aircraft  position  take  values  in  the  set  C  =  { B ,  1,  2,  3, ...,  Nr,  Nr  +  1}  where  B 
indicates  the  (blue)  base  and  NR  +  1  indicates  the  (red)  strategic  target.  We  suppose  a 
discrete  time  model  where  each  time  step  occurs  only  when  either  an  aircraft  engages  a 
SAM,  an  aircraft  engages  the  target,  or  an  aircraft  returns  to  base.  More  than  one  such 
activity  can  occur  at  each  step.  The  aircraft  control  for  each  aircraft,  Uf(t),  must  be  spec¬ 
ified  at  each  time  step.  The  set  of  possible  values  is  U  =  C  U  {0}  where  numbers  between 
U f  =  1  and  U 'f-  =  NR  +  1  indicate  attack  the  corresponding  red  SAM  or  target,  U 'f-  =  B 
indicates  return  to  base,  and  Uf  =  0  indicates  “do  nothing” .  Note  that  the  dynamics  of 
the  motion  is  simply  X^(t  +  1)  =  Uf(t)  when  Uf(t)  ^  0  and  X^(t  +  1)  =  Xf(t)  when 
Uf(t)  =  0.  We  place  some  restrictions  on  the  allowable  controls.  The  control  actions  will 
be  organized  into  cycles  of  length,  nc.  That  is,  each  cycle  will  consist  of  nc  time  steps. 
At  the  start  of  each  cycle,  all  aircraft  must  be  at  the  base.  Consequently,  we  require 
Uf-(t)  =  B  for  all  i  <  Na  and  all  t  =  knc  —  1  for  all  k  >  1.  We  also  require  that  for  any 
t  =  knc  —  1 


if  there  exists  i  such  that  Xf(t)  =  B  and  Df(t)  =  1,  then  there  must  be  a  k  <  Na  with 
D£(t)  ±  4  such  that  U£(t)  ±  B. 

{CC) 

Note  that  this  last  requirement  forces  at  least  some  aircraft  to  engage  red  during  each 
cycle  for  which  there  is  a  fully  healthy  aircraft. 

It  will  be  assumed  for  this  subproblem  that  the  SAMs  cannot  move  during  the  duration 
of  the  game.  The  controls  for  the  ith  SAM  at  (discrete)  time  t  is  GR(t),  taking  values 
in  {0, 1}  where  0  indicates  radar  on  and  1  indicates  radar  off.  As  mentioned  in  the 
introduction,  when  the  radar  is  on,  the  probability  of  the  SAM  inflicting  damage  on  the 
aircraft  rises  -  as  does  the  probabilty  that  the  aircraft  can  inflict  damage  on  the  SAM. 

The  health  status  of  each  of  the  objects  will  transition  according  to  a  discrete-time 
Markov  chain  model.  The  transisiton  probabilities  will  be  state/control  dependent.  To 
simplify  matters,  assume  that  multiple  aircraft  can  attack  a  single  SAM,  but  that  the 
aircraft  need  only  engage  one  SAM  at  a  time.  At  each  time-step,  where  a  SAM  is  under 
attack,  we  let  the  transition  probabilty  be  given  by  the  matrices  PR0 PR0 2,  PRn  and 
pRi2  indicating  the  transition  probabilities  for  the  cases  where  a  SAM  with  radar  off  is 
being  attacked  by  a  single  aircraft,  a  SAM  with  radar  off  is  being  attacked  by  multiple 
aircraft  simultaneously,  a  SAM  with  radar  on  is  being  attacked  by  a  single  aircraft,  and 
a  SAM  with  radar  on  is  being  attacked  by  multiple  aircraft  simultaneously,  respectively. 
Of  course,  there  are  many  more  possibilities,  but  we  consider  only  these  for  simplicity. 
If  a  SAM  radar  is  on,  and  the  SAM  is  not  under  attack  during  that  time  step,  then  we 
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assume  the  SAM  health  status  remains  constant  with  probability  one.  Lastly,  if  a  SAM 
site  is  olf  and  not  under  attack,  the  health  may  improve  through  maintenance,  with  a 
tranistion  probability  given  by  PR0°.  The  state  df  =  3  will  be  an  absorbing  state  for 
all  the  transistion  matrices.  In  particular,  maintenance  cannot  repair  a  SAM  once  it  has 
entered  state  3.  The  transition  probabilities  for  the  red  target  are  the  same  as  those  for 
a  SAM  with  radar  off. 

Let  the  corresponding  probabilities  for  the  aircraft  during  engagement  be  given  by 
PA01,  pA02 ^  pAii  an(j  pAi2  where  these  stand  for  the  same  situations  as  those  indicated 
for  the  SAMs  above.  We  assume  that  the  probability  of  transitioning  to  state  4  (down)  is 
nonzero  for  all  of  the  above  matrices  (i.e.  that  the  last  columns  have  no  zero  entries).  We 
also  allow  the  aircraft  to  undergo  maintenance  while  at  the  base  ( UA(t )  =  B),  and  let  the 
transition  probabilities  be  PAm .  For  the  aircraft,  one  must  also  consider  the  possibility 
of  damage  due  to  flying  over  SAMs  with  radars  that  are  on  while  enroute  from  one  point 
to  the  next.  For  instance,  if  X  A{t)  =  1  and  XA(t  +  1)  =  UA(t )  =  3,  and  if  SAMs  2  and 
4  are  between  1  and  3,  then  aircraft  i  could  suffer  damage  while  flying  over  each  of  the 
SAMs  2  and  4  -  if  they  are  on.  We  let  the  transition  probability  for  aircraft  health  due 
to  flying  over  a  SAM  that  is  on  ( Gf(t )  =  1)  and  not  destroyed  (DR(t)  ^  3)  be  PA1F 
for  each  SAM  that  is  flown  over.  In  the  above  example,  if  SAM  3  is  on  and  aircraft  i 
is  the  only  one  attacking,  then  its  transition  probability  for  this  time  step  is  given  by 
pAiF pAiF pAii '  Lagtf^  the  destroyed/down  state  will  be  absorbing  for  all  the  transition 
matrices  including  PA0°. 

Here  we  will  consider  a  simplified  information  pattern  that  is  chosen  to  mimic  the  real 
world  situation  in  a  rather  loose  way.  Specifically,  we  will  consider  the  game  where  at 
each  time  step  blue  chooses  its  control  given  the  current  state,  and  then  red  chooses  its 
control  given  the  current  state  plus  the  control  choice  for  blue  at  the  current  time.  In 
other  words,  we  are  interested  here  in  an  upper  value  (recall  blue  is  minimizing  and  red 
maximizing).  Let  the  value  function  for  this  game  be  denoted  by  V(yA,yR,yT).  Since  it 
is  quite  standard,  we  do  not  include  a  proof  of  the  DPE  (dynamic  programming  equation) 
which  is  given  as  follows. 


Theorem  2.1  The  value  function  satisfies 


V(yA,yR,f) 


min  max 

uAeuNA  gRe{o,i}NR 


e{v(Ya(1),Yr(1),YT(1)) 


Ya(0)  =  yA ,  ?R(0)  =  yR ,  yT(0)  =  f,  } 


=  min  max 

uAeuNA  gRe{o,i}NR 


g*A*R[V}(yA,yR,f) 


We  will  now  indicate  how  this  value  function  can  be  obtained  through  repeated  apppli- 
cation  of  the  backward  dynamic  programming  operator.  First  however,  we  will  need  the 
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following  lemma  which  essentially  implies  that  there  is  a  positive  probability  of  reaching 
the  absorbing  states  in  a  fixed  number  of  steps. 

Lemma  2.2  There  exists  n  <  oo  and  5  >  0  such  that  for  any  sequence  of  controls  for 
blue  and  red 

P  (YA(t  +  n),YR(t  +  n),YT(t  +  n))  G  8 

{YA{t),YR(t),YT(t ))  =  {yA,yR,f)\  >  s 

for  any  ( yA,yR,yr )  where  we  recall  £  was  the  exit  set. 

Proof.  Let  t\  =  min{s  >  t  :  s  =  knc  + 1  for  some  nonnegative  integer/?}.  Then,  by 
condition  (CC),  there  exists  i\  <  N a  such  that  XA(t\)  ^  B,  and  consequently,  there  exists 
c>i  >  0  (dependent  on  the  choice  of  transition  matrices)  such  that  P(DA(ti)  =  4)  >  cq.  Let 
Sli  C  O  (the  sampls  space)  be  given  by  fii  =  {w  G  :  DA{tf)  =  4}.  For  points  in  Qi  such 
that  (YA(ti),YR(ti),YT (ti))  tfL  £  (where  not  all  the  aircraft  are  down),  let  f2  =  min{s  > 
t\  :  s  =  knc  +  1  for  some  nonnegative  integer  A;}.  Then  again  by  condition  (CC),  there 
exists  z2  <  Na  such  that  X^(t2)  ^  B ,  and  consequently,  P({u  G  Di  :  D^{tf)  =  4})  >  <5i. 
Since  state  4  is  absorbing,  This  implies  that  P(DA(t )  =  4,  DA(t)  =  4f  >  5 j  for  all  t  >  t2. 
Proceeding  inductively,  one  finds  that  by,  at  most,  time  t  +  ncN a ,  the  state  is  in  E  with 
probability  no  less  than  SiA . 

Define  the  backward  dynamic  programming  (DP)  algortihm  as  follows.  Let  the  ter¬ 
minal  value  be 

w(o,  yA ,  yR,  f)  =  \  yR'  ^  if  yR'  yr)eE 

’  1 0  otherwise. 

(We  remark  that  the  choice  of  0  is  irrelevant.)  Given  W(k,  •),  one  computes  W(k  —  1,  •) 
by  the  backward  dynamic  programming  operator  given  by 

W(k  —  l,yA,yR,ifr)  =  min  max 

uAeuNA  gRe{0A}NR 

E^W{k}YA{l),YR{l),YT{l)) 

f^O)  =  yA,YR(0)  =  yR,YT( 0)  =  f,  } 

=  min  max  QuA^R[W(k,  •)\(yA,yR,yT) 

uAeUNASRe{0A}NR  1 

if  ( yA ,  yR,  y T)  ^  £  and  W(k  —  1,  yA,  yR,  y T )  =  4/(yA,  yR,  yr)  otherwise. 

Lemma  2.3  This  backward  dynamic  programming  propagation  operator  is  a  contrac¬ 
tion. 
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Proof.  Once  one  has  Lemma  2.2,  the  proof  of  this  lemma  is  a  minor  variation  of 
standard  results,  but  in  this  case  for  a  game  with  an  exit  criterion.  (See,  for  instance,  [4], 
[15]  for  similar  results.)  We  will  simply  indicate  some  of  the  main  points.  Let  W\  and  W2 
be  given  by  the  backward  DP  with  possibly  different  conditions  at  k  =  0.  For  simplicity, 
use  the  notation  y  =  ( yA ,  yR,  yr).  Note  that  (for  k  <  0) 

Wl(h,y) -W2{k,y) 

=  min  max  QuA^R\WAk  +  1,  •)](?/) 

u^euNA  gRe{o,i}NR 

~  ,min  max  g*A’sR[W2(k  +  1,  •)](£)• 

uAeuNA  £*£{0,1}^ 

Choose  u- j4  to  be  ^--optimal  for  W\  and  then  choose  gf  to  be  —optimal  for  W2  given  the 
same  control  uf  as  used  for  W\.  Then 

W^y) -W2{k,y) 

<  G^iw^k  +  1,-)-  W2(k  + 1, 0 }(y)  +  — . 

n 

Repeating  this  process,  one  finds  that  (for  k  <  — n )  and  proper  choice  of  feedback  controls, 
W1(k,y)  -W2(k,y) 

n 

<  J]  (k  +  n,  0  -  W2{k  +  n,  •)]($  +  2e 

m—  1 

where  we  are  using  the  n  notation  to  indicate  operator  composition.  Alternatively,  one 
may  write  this  as 


-W2(k  +  n,  z )]  •  P~( {m^}"=i,  {<]%}?=  1)}  +  2e 

where  this  last  term  indicates  the  probability  of  transitioning  from  y  to  z  in  n  steps 
given  the  feedback  control  processes  specified  in  the  arguments.  Using  symmetry  and  the 
Lemma  2.2,  one  obtains 

iW^y) -W2(k,y)\ 

<  max{|lUi(A:  +  n,  z)  —  W2(k  +  n,  z) |} 

Z 

E^-/({+}ti,{«tF=,)+2e 

Z^8 

<  max{|lUi(A:  +  n,  z)  —  W2(k  +  n,  F)|}(1  —  5)  +  2e. 


||Wi(^>  •)  —  W2(k,  *) Hoc 

<  (l-S^W^k  +  ^^-W^k  +  n,-)^ 
8 


This  then  yields 


The  proof  of  convergence  is  also  standard,  and  so  we  state  the  result  without  proof. 


Theorem  2.4  W ( k ,  yA ,  yR,  if")  converges  to  the  value  function  V ( yA ,  yR,  if)  as  k  f  — oo 
for  all  points  in  the  state  space. 

We  remark  that  since  the  controls  spaces  are  finite,  the  controls  actually  converge  in 
a  finite  number  of  steps. 


2.2  Reducing  the  Computations 


The  above  algorithm  for  the  computation  of  the  value  function  (and  corresponding  con¬ 
trol  policies)  suffers  from  the  curse  of  dimensionality  typical  for  DP  algorithms.  Specifi¬ 
cally,  notice  that  computation  of  Qu  ^  \W{k,  •)]($)  may  require  summing  the  product  of 

->y\_  ->Jl 

W ( k ,  z)  and  Pfjf9  over  all  possible  values  of  z  for  each  point  y.  More  specifically,  the 
computations  for  W(k  —  1  ,y)  (for  each  y)  require  0(4Na(Nr  +  NT  +  1)na^nr^nt^  0per_ 
ations,  even  without  optimization  over  blue  and  red  control  policies.  We  will  discuss  one 
of  the  methods  being  used  to  reduce  these  computation  costs.  The  method  will  involve 
an  approximation  of  W  at  each  step.  The  result  will  be  that  the  computational  costs  per 
y  point  will  be  reduced  from  the  above  exponential  growth  in  the  number  of  dimensions 
to  only  linear  growth  in  the  number  of  dimensions.  This  is  a  tremendous  reduction  in 
computational  costs  which  makes  the  difference  between  feasibility  and  infeasibility  of 
computation  for  low-dimensional  problems.  The  growth  in  the  number  of  points  at  which 
we  must  evaluate  W  remains  exponential  in  the  number  of  dimensions  of  course. 

We  introduce  the  following  operator  which  is  essentially  an  approximation  operator  for 
the  value  function  or  DP  iterates  around  any  given  point  y.  In  order  to  reduce  the  notation, 
we  will  consider  a  simplified  state  space  where  y  =  (yi,y2,ys)  with  yi  £  {1,2,  3, 4}  and 
?/2)  2/3  £  {1)2,3}.  This  will  reduce  notation  without  losing  the  flavor  of  the  method.  Define 
the  matrices  A1  for  i  =  1,  2,  3  given  by  Ajk  =  1  if  j  =  k  =  i  and  T*  =  0  otherwise.  Then, 
given  y,  define  the  approximation  operator  for  approximation  around  y  by 


«#■(•)](?) 


1 

LELl  \zi~Vi\. 
\  V(At(z 


•EUh 
-  if)  +  y)\ 


{ V(y) 


if  z^y 
\fz  =  y. 


The  operator  is  essentially  an  approximation  operator  where  convex  combinations  are 
used  to  approximate  V  for  states  which  are  not  directly  along  a  basis  direction  from  the 
point  around  which  V  is  being  approximated.  Although  we  will  not  discuss  the  error 
analysis  here,  we  note  that  of  course  the  appropriateness  of  an  approximator  of  this  form 
depends  critically  on  the  nature  of  the  value  function  itself  which,  in  turn,  depends  on 
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the  choice  of  terminal  payoff,  dh  Recall  that  since  the  problem  is  rather  loosely  defined, 
we  have  great  freedom  in  the  choice  of  dh  Now,  note  that  the  approximation  operator  is 
a  nonexpansive  map  for  any  y.  The  backward  DP  operator  of  the  previous  section  will 
now  be  replaced  by  the  approximate  backward  DP  operator  given  by 

W(k-l,y)  =  min  max  g^A’yR[Hff[W(k,  •)](•)] iv) 
ua£Una 

if  y  0  £  and  W  (A: — 1 ,  y)  =  d^y)  otherwise.  Using  the  nonexpansivity  of  this  approximation 
operator  and  the  contraction  property  of  the  backward  DP,  one  can  obtain  the  following 
result  in  a  straightforward  manner  similar  to  that  of  the  previous  section. 

Theorem  2.5  The  approximate  backward  DP  operator  is  a  contraction,  and  the  corre¬ 
sponding  iterates  converge  to  a  fixed  point  of  the  operator. 

Lastly,  we  indicate  the  promised  reduction  in  computation  via  the  approximation. 
Recall  that  each  of  the  transitions  is  independent.  Suppose  for  this  simplified  problem 
that  the  transition  matrices  for  yi,  y3,  y3  are  given  by  Pl,P2,Pz  respectively,  where  we 
are  suppressing  the  dependence  of  each  P  on  the  states  and  controls.  (Note  that  in  this 
simplified  problem,  we  have  actually  eliminated  that  position  state  for  the  aircraft.)  Then 
the  approximate  backward  DP  takes  the  form 


,Zl 


W(k-l,y) 

=  min  max  W(k,  y)P}  P2  P3 

uaeuna  gRe{o,i}NR  yuyi  y2,V2  m,m 

+  w(k,z! ,y2,y3)Q\z1-yi\piu 

Z  1  =  1 

+  Wik^y^Z^y?>)Q2\Z2-y2\Py2, 

Z‘2  —  1 

+  Yj  w (k,yi,y2,  ^Q^-ys^L 

z  3  =  1 


,22 


,23 


(2) 


(3) 


where 

j_  ^22  =  1  V2  | -^y2,Z2  ^23  =  1  1^3  I ^3,23 

Qlzi~ml  ~  ELi  In  -  yi\ 

with  analogous  definitions  for  Q“\Z2-y2\  and  QfZ3-y3\-  Note  that  these  Ql  may  be  pre¬ 
computed.  Thus  the  approximate  DP  (3)  has  only  linear  growth  in  the  compuations 
which  must  be  performed  at  each  step  (per  point  in  the  state  space). 
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2.3  Testbed,  Monte  Carlo  Simulation  and  Landscape  Plots 


The  games  controller  was  tested  via  Monte  Carlo  simulation.  The  first  purpose  was  to 
ensure  that  the  game  controller  was  bug-free  (it  wasn’t).  The  second  was  to  explore  the 
structure  of  the  results,  and  their  dependencies  on  system  parameters. 

For  the  first  set  of  tests,  a  simple  geometry  was  considered.  This  is  depicted  in  figure 
1.  A  corridor  has  been  determined.  The  corridor  is  such  that  one  must  pass  within 
the  umbrella  of  each  SAM  to  get  to  the  next,  and  finally  to  the  target.  The  SAMs  are 
numbered  1  to  3  from  bottom  to  top.  When  there  are  less  than  three  SAMs  in  some  of  the 
tests,  that  may  correspond  to  any  of  the  SAMs  in  Figure  la  being  missing.  (The  problem 
is  mathematically  independent  of  which  specific  ones  are  missing  in  this  geometry.) 

A  number  of  bugs  were  found  and  removed.  Most  notably,  a  problem  where  the 
controller  was  not  bookkeeping  the  number  of  aircraft  acting  in  tandem  properly  was 
corrected. 

It  was  soon  noticed  that  the  most  significant  feature  of  the  controllers  was  the  choice  for 
blue  of  whether  to  fly-over  the  SAMs  without  attacking  or  to  perform  a  rollback  policy 
(removing  the  first,  then  the  second,  then  the  third  before  the  target).  Intermediate 
policies  occurred  only  for  a  small  range  of  cases. 

Based  on  feedback  from  the  program  office,  it  was  decided  to  use  the  Monte  Carlo 
simulator  to  look  at  dependency  on  certain  parameters,  and  the  effect  of  mismodeling  of 
those  parameters.  This  study  was  undertaken  with  a  software  package  which  was  referred 
to  as  the  Sensitivity  Tool.  It  varied  both  actual  parameters  in  the  simulation  and  the 
corresponding  values  assumed  by  the  controller.  For  each  such  possibility,  a  Monte  Carlo 
series  was  run  (generally  with  approximately  1000-2000  sample  games).  The  results  were 
plotted  in  three-dimensional  figures  where  the  horizontal  axes  corresponded  to  true  and 
assumed  values  of  various  parameters.  The  Sensitivity  Tool  and  the  embedded  Monte 
Carlo  simulator  are  depicted  in  Figures  lb  and  lc. 


It  was  found  that  the  choice  of  fly-over  or  rollback  was  sensitive  to  the  probability  of 
damage  to  a  blue  aircraft  when  flying  through  the  SAM  umbrella  without  engaging.  Let 
this  probability  be  denoted  by  a.  (More  specifically,  the  transition  probabilities  for  the 
aircraft  are  4x4  matrices,  and  the  (1, 4),  (2,  4),  (3, 4)  entries  are  a,  the  (1, 1),  (2,  2),  (3,  3) 
entries  are  1  —  a,  the  (4,4)  is  1,  and  the  rest  are  zero.)  Consequently,  Monte  Carlo  runs 
were  done  where  the  true  value  of  a  and  the  value  the  controllers  believe  to  be  a  were 
varied.  For  most  of  the  runs,  both  blue  and  red  had  the  same  value  of  a.  This  was  done 
for  computational  reasons.  A  small  number  of  runs  were  done  where  red  had  the  true 
value,  and  so  only  blue  had  an  incorrect  value.  These  verified  that  the  main  structure 
we  will  see  is  due  to  the  blue  controller.  (The  runs  where  red  had  the  true  value  will  be 
discussed  later.) 
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Blue  Base 


Geography  1  Distillation 

Fig.  la. 
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(Corresponding  programs  indicated  in  brown.)  Start  testsRus2Maye 

_ Jz _  probdist 


Fig.  lb. 
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Sensitivity  Tool  for  Small  Games  cosTarrange 


MONTE  CARLO  FOR  SMALL  GAMES 

MCteste 


Fig.  lc. 
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(Corresponding  programs  indicated  in  brown.) 


The  Monte  Carlo  data  to  follow  was  made  with  the  following  parameter  values:  For 
notation,  see  the  ’’pmrmodel”  documentation  (in  ps  or  pdf  format). 

pA  =  20,  fiR  =  3,  pT  =  20 

pdaO  =  [1,  0,  0,  0;  0.4,  0.6,  0,  0;  0.3,  0,  0.7,  0;  0,  0,  0, 1] 

pdaOll  =  [0.99,  0.00,  0.00,  0.01;  0.0,  0.95, 0.025, 0.025;  0.0, 0.0, 0.9, 0.1;  0.0, 0.0, 0.0, 1.0] 

pda021  =  [1.0,0.0,0.0,0.00;  0.00,0.98,0.01,0.01;  0.00,0.00,0.96,0.04;  0.00,0.00,0.00,1.00] 

pdalll  =  [0.82,  0.01,  0.02,  0.15;  0.00,  0.82,  0.03,  0.15;  0.00,  0.00,  0.75,  0.25;  0.00,  0.00,  0.00, 1.00] 

pdal 21  =  [0.92,  0.01,  0.02,  0.05;  0.00,  0.92,  0.03,  0.05;  0.00,  0.00,  0.87,  0.13;  0.0,  0.0,  0.0, 1.0] 

pdadr3  =  [0.995, 0.000, 0.00, 0.005;  0.00, 0.98, 0.01, 0.01;  0.00, 0.00, 0.97, 0.03;  0.0, 0.0, 0.0, 1.0] 

pdfoon  =  (see  above) 

pdrO  =  [1,0,0;  0.3,  0.7,0;  0,0,1] 

pdrOll  =  [0.7,  0.0, 0.3;  0.0,  0.6,  0.4;  0.0,  0.0, 1.0] 

pdr021  =  [0.4, 0.0, 0.6;  0.0, 0.35, 0.65;  0.0, 0.0, 1.0] 

pdrlll  =  [0.7,  0.0, 0.3;  0.0,  0.5,  0.5;  0.0,  0.0, 1.0] 

pdrl21  =  [0.3, 0.0, 0.7;  0.0, 0.25, 0.75;  0.0, 0.0, 1.0] 

For  the  purposes  of  reader  understanding,  the  output  value  has  been  multiplied  by  —1 
and  had  a  constant  added  so  that  it  takes  the  form 

v  =  f4  E(3  -  A4)  -  l‘R  £( A*  -  2)  -  P(DT  -  2). 

i  i 

In  this  case,  the  minimum  value  is  0  (rout  for  blue),  and  the  maximum  value  is  3 pAnA  + 
2pRriR  +  2(iT  where  ha  and  hr  are  the  numbers  of  aircraft  and  SAMs.  Note  that  in  this 
case,  larger  numbers  are  better  for  blue  and  vice-versa. 
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Pic.#1  MC  experiment:  flyover-risk  sensitivity  —  game  value 


All  runs  are  made  with  2000  sample  points  unless  otherwise  specified.  Each  set  of 
figures  takes  from  2-f6  hours  to  produce  on  a  typical  workstation.  The  controllers  are 
computed  ff  times  (for  each  a),  and  2000  simulation  runs  are  made  for  each  of  the  f2f 
points  on  the  graph. 


The  first  set  of  data  is  for  the  case  of  one  aircraft  and  one  SAM  site. 

•  Figure  2.3. f  is  the  sample  mean  value. 

•  Figure  2.3.2  is  the  sample  standard  deviation. 

•  Figure  2.3.3  is  the  sample  mean  number  of  surviving  a/c 

•  Figure  2.3.4  is  the  sample  mean  number  of  surviving  SAMs 

•  Figure  2.3.5  is  the  sample  mean  number  of  surviving  targets 

•  Figure  2.3.6  is  the  sample  mean  number  of  cycles  until  the  end  of  the  game. 

(Recall  that  the  game  ends  when  all  the  blue  a/c  are  down,  or  the  red  target  is 
destroyed.) 
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Pic.#2  MC  experiment:  flyover-risk  sensitivity  —  game  value  (s.d.) 


ActualPk  °  0  Control  pk 

Fig.  2.3.2 


Pic.#3  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  A/C 


Actual  p 


0  0 


Control  p 


Fig.  2.3.3 
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Pic.#4  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  SAMs 


Actual  p 


0  0 


Control  p 


Fig.  2.3.4 


Pic.#5  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  targets 


Fig.  2.3.5 
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Pic.#6  MC  experiment:  flyover-risk  sensitivity  —  avg.  game  time 


Fig.  2.3.6 


Note: 

1.  The  value  has  two  “regions”.  In  the  left  region  it  is  roughly  linear  (over  the  region), 
and  in  the  right  region  it  is  constant. 

2.  The  value  is  monotonically  decreasing  as  a  function  of  the  true  a. 

3.  For  each  line  of  constant  true  a,  the  value  takes  on  its  maximum  at  the  same  value 
of  control  a.  (The  blue  controller  effect  dominates.) 

4.  The  right  side  corresponds  to  rollback,  the  left  to  fly-over. 

5.  The  standard  deviation  for  the  rollback  is  higher  than  that  of  the  fly-over  even  when 
the  mean  value  is  lower. 

6.  The  average  number  of  surviving  units  is  constant  on  the  right  region. 
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Pic.#1  MC  experiment:  flyover-risk  sensitivity  —  game  value 


Fig.  2.3.7 


The  second  set  of  data  is  for  the  case  of  two  aircraft  and  one  SAM  site. 

•  Figure  2.3.7  is  the  sample  mean  value. 

•  Figure  2.3.8  is  the  sample  standard  deviation. 

•  Figure  2.3.9  is  the  sample  mean  number  of  surviving  a/c 

•  Figure  2.3.10  is  the  sample  mean  number  of  surviving  SAMs 

•  Figure  2.3.11  is  the  sample  mean  number  of  surviving  targets 

•  Figure  2.3.12  is  the  sample  mean  number  of  cycles  until  the  end  of  the  game. 
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Pic.#2  MC  experiment:  flyover-risk  sensitivity  —  game  value  (s.d.) 


Ac,ualPk  °  0  Control  pk 

Fig.  2.3.8 


Pic.#3  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  ACs 


Fig.  2.3.9 
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Pic.#4  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  SAMs 


Fig.  2.3.10 


Pic.#5  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  targets 
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Fig.  2.3.11 
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Pic.#6  MC  experiment:  flyover-risk  sensitivity  —  avg.  game  time 


Fig.  2.3.12 


Note: 

1.  The  value  still  has  two  “regions”.  In  the  left  region  it  is  roughly  linear  (over  the 
region),  and  in  the  right  region  it  is  constant. 

2.  The  switch-over  point  from  fly-over  to  rollback  is  different. 

3.  The  structures  of  all  figures  are  the  same  as  for  the  previous  data  set. 


The  third  set  of  data  is  for  the  case  of  two  aircraft  and  three  SAM  sites. 

•  Figure  2.3.13  is  the  sample  mean  value. 

•  Figure  2.3.14  is  the  sample  standard  deviation. 

•  Figure  2.3.15  is  the  sample  mean  number  of  surviving  a/c 

•  Figure  2.3.16  is  the  sample  mean  number  of  surviving  SAMs 

•  Figure  2.3.17  is  the  sample  mean  number  of  surviving  targets 

•  Figure  2.3.18  is  the  sample  mean  number  of  cycles  until  the  end  of  the  game. 
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Pic.#1  MC  experiment:  flyover-risk  sensitivity  —  game  value 


Control  p 


Fig.  2.3.13 


Pic.#2  MC  experiment:  flyover-risk  sensitivity  —  game  value  s.d. 
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Fig.  2.3.14 


24 


Pic.#3  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  A/C 


Fig.  2.3.15 


Pic.#4  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  SAMs 


Fig.  2.3.16 


25 


Pic.#5  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  targets 
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Fig.  2.3.17 


Pic.#6  MC  experiment:  flyover-risk  sensitivity  —  avg.  game  time 


Fig.  2.3.18 
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Note: 


1.  The  value  still  has  two  “regions”.  In  the  left  region  it  is  roughly  linear  (over  the 
region),  and  in  the  right  region  it  is  constant.  Again  the  structure  is  essentially  the 
same. 

2.  The  data  is  noisier. 

3.  The  switch-over  point  from  fly-over  to  rollback  is  different. 

4.  A  close  examination  of  Figure  2.3.16  (surviving  SAMs)  indicates  that  the  switch-over 
is  not  quite  complete  from  one  control  a  to  the  next.  A  little  bit  of  the  switch-over 
is  not  complete  until  after  two  steps.  Actually,  from  other  runs,  we  have  seen  that 
there  is  a  reduction  from  complete  fly-over  to  rollback  in  several  stages,  but  the 
change  is  so  rapid  (as  this  parameter  changes)  that  to  a  good  approximation,  it  is 
a  single  switching  point. 

5.  Note  that  in  Figure  2.3.13,  the  value  may  not  quite  have  its  maximum  relative  to 
the  line  true  a  =  0.014  at  control  a  =  0.014.  This  may  represent  some  small  error 
due  to  approximations  made  in  the  control  computations. 
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Blue  Base 

Geography  2  Di 

Fig.  2.3.19 

Different  Geographies 


The  controller  can  deal  with  more  complicated  geometries  than  that  of  the  previous 
data  sets.  We  have  run  it  with  the  geographical  distillation  depicted  in  Figure  2.3.19. 
The  geography  is  distilled  into  a  file  where  the  red  sites  whose  umbrella  must  be  flown 
through  in  going  from  site  A  to  site  B  are  recorded.  The  controller  has  been  tested  for 
this  example  geometrie  as  well.  The  landscape  plots  for  these  geometries  are  similar  to 
those  above  and  are  not  included.. 
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2.4  Differing  SAM  lethalities 


Previously,  there  was  only  one  type  of  SAM  in  our  Small  Game  Controller  and  Testbed. 
In  this  subsection,  we  indicate  the  changes  that  were  made  to  allow  the  SAMs  to  have 
both  different  effective  radii  and  different  lethality  (strengths).  The  possiblity  of  different 
strength  SAMs  had  a  significant  effect  on  the  shape  of  the  value  landscape  and  the  optimal 
controls. 

The  different  radii  are  an  off-line  matter  which  is  eliminated  in  the  geography  distil¬ 
lation  hie. 

The  different  lethality  has  been  added  to  both  the  control  software  and  the  Small  Game 
testbed.  In  the  previous  subsection,  the  optimal  Blue  strategy  was  essentially  always 
either  hy-over  or  rollback.  With  mutiple  SAM  strengths,  the  number  of  possibilities 
increases.  With  two  SAM  strengths,  there  are  three  control  regions  for  Blue  (hy-over, 
partial  rollback,  rollback).  See  Figures  on  following  pages  for  an  example  with  two  a/c, 
two  SAMs  and  a  target.  The  SAMs  had  different  lethalities. 
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Pic.#1  MC  experiment:  flyover-risk  sensitivity  —  game  value 


Fig.  2.4.1:  Two  SAM  lethalities,  Game  Values 


An  obvious  question  is  whether  the  two  control  policy  jumps  in  the  previous  slide  were 
due  to  two  SAM  strengths  or  to  two  SAMs?  Figures  2. 4. 7-2. 4. 9  are  for  2  a/c  and  3  SAMs 
where  SAMs  have  two  strength  types.  Two  SAMs  are  weaker  and  one  is  stronger  while 
under  direct  attack;  kept  fly-over  damages  same  for  both  types  in  this  example.  Note 
that  there  are  only  TWO  policy  jumps,  thus  indicating  that  the  effect  is  related  to  the 
number  of  types  of  SAMs  not  the  number  of  SAMs. 
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Pic.#2  MC  experiment:  flyover-risk  sensitivity  —  game  value  s.d. 


Ac,ualPk  °  0  Control  pk 

Fig.  2.4.2:  Two  SAM  lethalities,  Sample  S.D.  of  Game  Values 


Pic.#3  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  A/C 


Fig.  2.4.3:  Two  SAM  lethalities,  Average  remaining  aircraft 
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Pic.#4  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  SAMs 


ActualPk  °  0  Control  pk 

Fig.  2.4.4:  Two  SAM  lethalities,  Average  remaining  SAMs 


Pic.#5  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  targets 
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Fig.  2.4.5:  Two  SAM  lethalities,  Average  remaining  Targets 
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Pic.#6  MC  experiment:  flyover-risk  sensitivity  —  avg.  game  time 
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Fig.  2.4.6:  Two  SAM  lethalities,  Average  game  time  (cycles) 


The  major  result  here  is  that  one  does  not  need  to  search  over  all  of  a/c  control  space. 
The  search  appears  to  be  reduced  to  only  n  + 1  (or  maybe  2n)  policies  for  blue  where  n  is 
the  number  of  SAM  lethality  types.  (Differing  radii  of  coverage  do  not  affect  this.)  This 
implies  a  HUGE  computational  savings;  much  larger  problems  solvable  at  this  low  level 
in  the  hierarchy. 
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Pic.#3  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  A/C 


Fig.  2.4.7:  Two  SAM  lethalities  3  SAMs,  Average  remaining  aircraft 


Pic.#4  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  SAMs 


Fig.  2.4.8:  Two  SAM  lethalities  3  SAMs,  Average  remaining  SAMs 
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Pic.#5  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  targets 


In, 


^  0.8% 
0 

P 

nJ 

h- 

®0.6n 


E 


Fig.  2.4.9:  Two  SAM  lethalities  3  SAMs,  Average  remaining  Targets 


2.5  Incorrect  Assumptions  of  System  Parameters 

In  the  above  subsections,  the  controllers  for  both  the  Blue  and  the  Red  players  were 
computed  from  the  same  minimax  code  under  the  same  assumptions.  This  yields  the 
worst-case  opponent  from  the  Blue  perspective.  A  modified  version  of  the  Small  Game 
testbed  was  produced  where  the  controllers  do  not  operate  under  the  same  assumptions. 
Both  the  model  parameters  (transition  probabilities)  and  the  payoff  functions  that  each 
player  uses  may  differ  from  that  used  by  the  other  player. 

A  small  number  of  results  were  generated.  In  the  examples  from  Subsection  2.3,  Blue 
and  Red  both  assume  a  value  of  ^r  =  3  for  the  SAMs.  (Note:  /jla  =  20,  ht  =  20.)  Here 
however,  we  modify  this  so  that  Blue  assumes  hr  =  3,  and  Red  assumes  hr  =  20.  Results 
were  plotted  according  to  the  Blue  payoff  function. 

The  case  was  for  2  a/c  and  2  SAMs.  This  leads  to  control  lookup  tables  of  approxi¬ 
mately  13,000  lines.  The  tables  used  different  Red  controls  in  approximately  2,000  lines 
(of  the  13,000).  However,  the  outcomes  were  not  significantly  affected  for  this  pair  of 
different  of  assumptions.  The  new  payoff  is  depicted  in  Figure  2.5.1,  and  the  difference 
between  this  value  and  that  for  the  case  where  both  assume  Hr  =  3  is  depicted  in  Figure 
2.5.2.  The  point  is  that  in  this  particular  example,  although  many  control  strategy  lines 
were  changed,  the  critical  lines  were  not  changed. 
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PicJI  MC  experiment:  flyover-risk  sensitivity  —  game  value 


Fig.  2.5.1:  Red’s  fiR  =  20  (not  same  as  Blue),  No.  of  iterations  =  500 


Pic.#1  Red  and  Blue  with  different  plant  models  —  game  value  difference 


Fig.  2.5.2:  Difference  in  value  functions 
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Pic.#1  MC  experiment:  flyover-risk  sensitivity  —  game  value 


2.6  Observation  Delays 

We  did  a  small  study  of  the  effects  of  observation  delays.  For  this  study,  we  assumed  that 
the  observations  of  the  state  are  perfect,  but  are  delayed  by  some  fixed  amount  of  time. 
In  the  experiment,  both  players  experienced  the  same  informational  delays.  The  results 
in  Figures  2.6  are  for  the  case  of  one  aircraft  and  one  SAM. 
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Pic.#2  MC  experiment:  flyover-risk  sensitivity  —  game  value  s.d. 


Fig.  2.6.2:  Standard  Deviation 


Pic.#3  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  A/C 


Fig.  2.6.3:  Average  remaining  aircraft 
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Pic.#4  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  SAMs 


Fig.  2.6.4:  Average  remaining  SAMs 


Pic.#5  MC  experiment:  flyover-risk  sensitivity  —  avg.  surviving  targets 


Fig.  2.6.5:  Average  remaining  targets 
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Since  we  have  delay  in  both  controllers,  one  does  not  see  a  monotone  decay  in  the 
value  as  a  function  of  delay.  However,  the  delay  does  tend  to  have  a  more  deleterious 
effect  on  Blue.  In  the  fly-over  region,  the  delay  has  negligible  effect.  Note  finally,  that 
the  length  of  the  game  tends  to  increase. 
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3  Optimal  and  Near-Optimal  Route  Generation 

The  first  step  taken  is  to  find  the  optimal  routes  to  the  target  (s)  for  the  aircraft  while 
avoiding  the  SAMs.  First  we  discuss  the  control  problem  formulation. 


3.1  Control  Problem  Formulation 


We  formulate  the  optimal  routing  problem  from  the  base  xA  =  (xf,  xf)  to  the  target 
xT  e  R2  as  an  optimal  (exit)  control  problem; 


fT  nr 

min  jf  (1  +  ^2  ai  ^i(XA  ~  xf))  dt 


(3.1) 


subject  to  Xj4(0)  =  xA ,  XA(r)  =  xT  and 


=  u(t),  \u(t)\  <  1,  (3.2) 

where  XA(t)  £  R2  is  the  location  of  an  aircraft,  u(t)  =  (ui(t),U2(t))  is  the  velocity 
control,  and  xf  is  the  ith  opponent  SAM  site  with  strength  cq,  1  <  i  <  NR.  The  loss 
function  £i  represents  a  loss  due  to  flying  close  to  the  site  i,  and  for  example  £i(xA  —  xf)  = 
i^A~R-y  That  is,  the  optimal  route  is  determined  so  that  the  sum  of  time  and  total  loss 

is  minimized.  The  optimal  control  of  (3.1)  is  given  by  the  feedback  law  [14] 


v.AxA(  t)) 


|v;,(x->(*))| 


(3.3) 


where  the  value  function  V  satisfies  the  Hamilton-Jacobi-Bellman  equation 


■\Vxa(xa)\  +  £(xA)  =  0,  V(xT)  =  0 

Nr 

with  £{xa)  =  1  +  (Ti  £i(xA  —  xf), 

i— 1 


(3.4) 


Next,  we  describe  the  numerical  method  to  HJ  equation  (3.4).  Let  Vij  denote  an 
approximation  of  V  at  each  grid  point  (xt,  t/j)  which  is  uniformly  distributed  over  a  square 
domain  Q  in  R2.  Let  h  >  0  be  the  stepsize  and  we  define  the  backward  and  forward 
difference 


(3.5) 


K)ijV 

(d;  ),,v 


Vi  ,j  Vi  —  1  ,j 

h 


Vi, 3  Vi, 3-1 

h 


(D}kN  =  2±1Ak 

W)ijV  = 


We  use  the  upwinding  method  of  Godnov  to  discretize  (3.4): 


7[ma  x((D-),V, -(D+),jV)Y-  +  [ma  x((D~),jV,-(D*)h,V)r-  =  (h, 
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where 


^(*£*3  Vj)' 


We  employ  the  fixed  point  iterate  [36]:  let  denote  the  n-th  iterate  and  we  update 
V)”+1  by  solving  (3.5)  for  V [”+1  at  each  grid  point,  given  Kp+i  The 

exact  step  is  given  as 


aid  =  min(V]'ilj.  K+ij),  hi  =  min(^?i-i3  1),  d 


=  ^  • 
hJ 


—  ^ 


hJ 


hi ) 


2 


—  5(^*5  T  4p)  T  \/ Dq  3”  $i,j  if  &i,j  ^  0 

< 

,  =  min(aD  +  hi)  +  ^  si,i  <  0 

We  have  the  boundary  (exit)  condition  Vjj  =  0  at  the  target  grid  and  also  we  set  Vyj  =  00 
at  the  boundary  of  Cl.  The  initial  iterate  can  be  set  is  as  VT  =  \(xi,yj)  —  xT\. 

We  solve  the  closed-loop  equation 

d  VxA(XA(t )) 

dt  [  >  \vxA(xhm 

by  the  finite  difference  method,  given  stepsize  At  >  0.  We  approximate  Vx  at  the  grid 
point  ( Xi ,  yj)  by  the  central  difference  approximation 

yry  _  ( Vi+l,j  ~  h,j+ 1  ~ 

i,j  V  2  h  ’  2  h  ) 


and  use  the  bi-linear  interpolation  at  (x,y)  in  the  i,j  sub-square,  i.e. , 

Vxa(Xa)  ~  ^(a:,  y)  =  (1  —  di)((l  —  d2)dhq  +  G^'Ibj+i)  +  di((l  —  (4)^+1, j  +  ^^j+ij+i) 
where  XA  =  ( x ,  y)  and  d\  =  x  —  Xi  and  d2  =  y  —  yj-  Thus, 


ya 

Afc+ 1 


X?  = -At 


In  summary 

•  Optimal  routing  problem  is  formulated  as  an  optimal  control  problem. 

•  Optimal  route  is  determined  by  Dynamic  Programming  (DP)  principle. 

•  We  develop  an  efficient  and  robust  algorithm  based  on  DP. 

•  The  algorithm  is  implemented  on  Matlab  and  runs  under  30  sec  for  100  by  100  grid. 
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In  Figure  3.1  we  show 

(1)  The  contour  of  the  Value  function  (the  potential  curves  for  cost  to  the  target  (0,0) 
from  the  base)  Red  (high)  to  Blue  (low). 

(2)  Routes  from  various  starting  points  (black  lines). 

(3)  Each  route  is  normal  to  the  potential  curves. 

(4)  7  SAMs  (spots  on  Fig.)  are  covering  the  target  at  (0,0)  with  equal  strength. 

The  algorithm  assigns  the  risk  factor  to  each  selected  route  and  determines  a  sequence 
of  SAMs  which  we  may  be  engaged  with  on  the  way  to  the  target.  The  algorithm  is  also 
used  to  determine  the  accessibility  to  the  selected  targets  from  a  specified  air  base  by 
formulating  the  exit  problem  to  the  specified  base. 

We  have  also  tested  the  algorithm  by  varying  the  strength  cq  of  SAMs.  The  uncertainty 
of  SAM  location  can  be  incorporated  in  our  formulation  by  replacing  the  point  location 
to  the  normal  distribution  with  zero  mean  and  selected  variance.  It  is  observed  that 
the  algorithm  generates  the  routes  which  reflect  to  changes  in  the  SAM  strength  and 
uncertainty  of  their  locations. 

The  geographical  constraints  can  be  incorporated  by  modifying  the  loss  function  l 
accordingly. 


3.2  Multi-Body  Dynamic  Formulation 


Our  proposed  optimal  routing  algorithm  is  quite  efficient.  It  is  most  useful  in  the  op¬ 
erational  and  planning  level.  In  order  to  incorporate  the  SAM  movement  and  agile  and 
dynamical  changes  in  their  uncertainty  and  conditions,  we  propose  the  following  feedback 
law  based  on  multi-body  interaction  dynamics.  The  algorithm  can  be  implemented  in 
real  time  and  on-board.  Let  xj  be  the  target  location  with  value  Wj  for  1  <  j  <  NT.  A 
route  XA(t),  t  >  0  is  determined  as  a  solution  to 


d 

dt 


XA(t) 


H4_4(A’A(i)) 

\Wx*(X*(t))\' 


Aa(0)  =  xA 


(3.6) 


where  the  potential  function  W  is  given  by 


Nt  Nr 

W ( XA )  =  ^  Wj  \xA  —  xj\  +  ai  U(\xJ 
j= 1  i=l 


e.g.,  U(\xA  —  xf  |)  =  j -^zz^jry  Thus,  the  force  field  Wxa  is  given  by 


- Wxa(xa ) 


£ 


3  =  1 


Wd 


A  T 

_  'JC' 


Nr  A  _  rrR 

E  X 

1  I  nr  A  _  1 3  ' 

i= 1  \x  I 
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Fig.  3.1 
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QgA — J,R  ^ R 

Here  the  term  —  .  A_  A  represents  an  attracting  force  to  the  target  j  and  the  term  .  A_  L3 

\x  Xj  \  \X  xi  | 

is  for  a  repelling  force  from  the  SAM  site  i. 

We  can  relate  a  closed  loop  system  (3.6)  to  the  optimal  control  problem  (3.1)-(3.2) 
as  follows.  We  define  the  performance  index  £(xA)  by  £  =  IW^I-  Note  that  if  <jj  =  0  and 


NA  =  1,  then  V  =  W  =  \xA 


xT\  and  u(t)  =  -  ,gjg_g| 


is  optimal.  \WX  (or4)!  attains 


local  minima  and  maxima  at  the  same  points  as  £(xA )  defined  by  (3.4)  does. 

We  compared  the  proposed  algorithm  with  the  optimal  routes  we  generated.  We  ob¬ 
served  that  the  algorithm  generates  a  similar  route  to  the  optimal  one  with  appropriately 
chosen  SAM  strength  <7j  In  the  following  Figure  we  show  our  comparison  results. 

Similarly,  we  also  construct  a  movement  of  SAMs  as  follows.  We  assume  that  they 
protect  the  targets  while  avoiding  voids. 


d  R  =  w-;f(A--4(f),.Y«(<)) 
*  ‘  \wMxA(t),xK(t))\ 


(3.7) 


where  the  potential  function  W  is  given  by 


nt 


W (xA,  xR)  —  —  ^2  wj 

3  = 1 


R 


Xa 


Nr  Nr  Nr 

+  E  E  Wi  U(\x?  -  xf\)  +  Y/aiU(\xA  -x 

i— 1  j— 1  i— 1 


R  l 


e.g.,  U{\xf  -  xf  |)  =  \xul_xk\  -  Thus  the  force  field  Wxr  is  given  by 


nt 


Wxu(xA,xR)  =  -Y. 


w 


Nr 

+  E- 

3^ 


X, 


3  =  1 
R 


R  _  T 

*Aj  ^  j 

3  TZr~~X~\ 

tA7  tA7  | 


X 


R 


1'3  \tR  _  tR  13 

^  *Aj  rj  | 


-  cr?: 


I  xf-  —  XA\ 


X-  — X 

with  attracting  force  — .  lR  3T .  to  the  target  j  and  repelling  forces  . 

\X  •  X  •  \X  •  X  • 

I  l  J  I  I  l  J  I 

A  combined  closed  loop  dynamics  (3.6)-(3.7)  has  the  game  theoretic  interpretation 
that  is  similar  to  the  one  for  the  optimal  control  problem.  We  consider  the  differential 
game 


n  N 


(3.8) 


mm  max 

U  V 


f  (EE  wi  \xe~xJ\  +  fl  Nxe  -  Vi)} 
J0  e=i  i=i  i= i 


N 


EE  vxji{x\ 

i=l j= 1 


R 


xf)  +  E  wi  I  fa. 

3  =  1 


R 


xf\  )dt 
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subject  to 


=  M  -  1’  J \X fW  =  N  <  Z5- 

The  optimal  feedback  solution  to  (3.8)  is  given  by 

(f\=  (f\  =  a  Vxn(xA(t),XR(t)) 

U{)  |14.(^(t),^(t))|’  VU  P  \VxR(x^{t)}X^m 

where  the  game  value  V  satisfies 

(3.9)  — | Vxa(xa,  xr)\  +  /5  | Vxr(xa,  xr)\  +  £(xa ,  xR)  =  0. 

Now  we  set  V  =  W  by 


n  N  m 

w  =  EE  wi  \xt~xJ\  +  E  ^  u(xt  -  x?)} 

i—  1  j  =  l  Z  =  1 

mm  N 

“EE  °i,jU(xR~: 

i=l j= 1  J=1 


+  H  Wj  IbR 


|(a;f 


and 

l{xA,xR)  =  x^)!  — /5  a:^)!* 

Then  (3.8)  holds. 


4  Expanding  Problem  Size  with  Hierarchical  Tech¬ 
niques 

This  section  is  divided  into  two  parts.  The  first  is  concerned  with  dynamic  allocation 
games  with  complete  observation.  The  main  approach  involves  game  theoretical  studies 
and  hierarchical  decomposition  methods.  The  last  subsection  contains  some  mathematical 
proofs. 


4.1  Hierarchical  Games  with  Complete  Observation 

Two  troops  (red  and  blue)  are  engaged  in  a  battlefield.  Blue  troop’s  assets  include  aircrafts 
(e.g.  bombers)  and  its  goal  is  to  damage  or  destroy  red  targets  (e.g.  oil  refinery)  and 
surrounding  SAMs  (surface-to-air  missiles).  Red  troop  on  the  other  hand  operates  SAMs 
to  protect  their  resources  (targets)  and  aims  at  damaging  or  destroying  blue  aircrafts. 

In  this  section,  we  consider  the  case  with  completely  observable  states,  i.e.,  all  red 
states  are  available  to  blue  troop’s  decision  making. 

Assume  that  the  battle  field  is  geographically  divided  into  two  regions  R\  and  R2 ;  see 
Fig.  1. 
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To  illustrate  the  idea  of  hierarchical  control,  let  us  divide  further  each  of  these  regions 
into  two  smaller  sub-regions  Rij,  i,j  =  1,  2  so  that  R\  =  Rn  U  Ru  and  R2  =  R21  U  i?22- 

For  simplicity,  we  assume  that  there  are  four  targets,  one  in  each  of  these  four  sub- 
regions.  A  target  has  two  states  {functional,  destroyed}  denoted  by  {1,0},  respectively. 
Here  “functional”  means  it  is  either  operational  or  partially  operational.  We  use  Tp-  to 
denote  the  state  of  the  target  in  sub-region  R^,  i,j  =  1,  2. 

Decisions  that  concern  both  blue  and  red  troops  involve  allocations  of  assets  between 
different  regions  or  sub-regions  over  time. 

Owing  to  the  inherent  complexity  of  command  control  systems,  it  is  difficult  to  obtain 
exact  optimal  solutions.  To  reduce  the  overall  complexity  of  the  underlying  system,  it  is 
necessary  to  resort  to  hierarchical  control  approach  via  aggregation  and  disaggregation 
methods.  Analysis  of  various  aggregation  methods  in  connection  with  manufacturing 
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systems  and  continuous-time  dynamic  systems  and  their  near  optimality  can  be  found 
in  [38],  [39]  and  references  therein.  In  this  report,  we  consider  the  following  hierarchical 
structure: 

High  level.  Re-allocation  between  Ri  vs.  R2.  These  allocations  are  usually  costly 
and  are  infrequent  events.  These  decisions  are  made  at  higher  level  of  the  command 
control  hierarchy. 

Low  level.  Re-allocation  between  Rn  vs.  Rj2,  i  =  1,2,  These  are  less  costly  and 
occur  occasionally.  Such  decisions  are  made  at  lower  level  of  the  hierarchy. 

Such  decomposition  approach  is  natural  in  command  control  systems  because  the 
interaction  among  each  regions  is  weak  compared  with  the  interaction  within  the  regions. 

Let  Na  denote  the  total  number  of  blue  aircraft  units  and  Ns  the  total  number  of  red 
SAM  units,  respectively,  that  are  initially  available. 

In  this  report,  let  us  first  focus  on  the  upper  level  decision  making.  Lower  level 
problems  will  be  treated  subsequently.  For  i  =  1,2,  we  consider  the  aggregated  target 
variable  Tj  defined  as  Tj  =  TR  +  Tj2.  Recall  that  Ty  £  {0, 1}.  Thus  T*  £  {0, 1,  2}. 

Re-allocation  Decisions:  Both  blue  and  red  troops  have  the  option  of  moving  their 
assets  from  one  region  to  the  other  at  certain  costs.  For  simplicity,  here  the  allocation 
is  assumed  to  be  instantaneous.  Delays  in  these  allocations  can  be  handled  in  a  similar 
fashion. 

Given  i  =  1,2,  let  Xf  denote  the  total  number  of  blue  aircrafts  in  region  Ri  and  let 
X-  denote  the  total  number  of  red  SAM  units  in  R*. 

The  blue  troop  decides  if  there  is  a  need  to  move  a  number  of  aircraft  units  from  one 
region  to  the  other.  We  use  Uf  to  denote  the  new  aircraft  allocation  in  region  Rj. 

Similarly,  the  red  troop  makes  a  decision  over  time  on  the  allocation  of  its  SAM  units 
and  the  new  allocation  denoted  by  Uf  in  Rj,  for  i  =  1,  2. 

A  fixed  cost  is  incurred  each  time  a  re-allocation  is  made.  For  example,  if  la  aircraft 
units  are  moved  from  Rx  to  R2,  then  corresponding  cost  to  blue  troop  is 

Ka  ■  R,  for  given  Ka  >  0. 

In  this  case,  Uf  =  Xf  —  la  >  0  and  Uf  =  Xf  +  la. 

Similarly,  if  Is  SAM  units  are  moved  from  Ri  to  R2,  then 

Ks  •  Is,  for  some  Ks  >  0, 

is  incurred  and  U{  =  X[  —  Is  >  0  and  R|  =  X|  + 

We  use  the  notation  Xa  =  (Xf,X$),  Xs  =  (Xf,X2),  T  =  (TUT2),  Ua  = 
and  Us  =  (Rf,R|).  We  also  use  lx  =  (Xa,  Xs ,T)  and  Ijj  =  ( Ua,Us,T )  to  denote  the 
states  before  and  after  a  re-allocation,  respectively. 
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Transition  Probabilities.  Given  the  current  state  vector  Ix,  if  there  is  no  re¬ 
allocation  decision  is  made,  then  the  states  of  aircrafts,  SAMs,  and  targets  can  jump 
to  any  states  according  to  a  conditional  probability  given  lx-  If  there  is  a  need  for 
a  re-allocation,  then  such  allocation  will  be  made  immediately  and  lx  is  changed  to 
Iu  =  ( Ua,Us,T ).  The  distribution  of  the  new  state  vector  will  be  determined  by  the 
conditional  probability  given  Iu- 

In  each  region  i?*,  the  jump  rates  of  Xf  depends  on  Xf.  Similarly,  the  jump  rates  in 
X-  depends  on  Xf  and  the  jump  rates  in  Tj  depends  on  both  Xf  and  Xf. 

Objective  Function.  The  game  is  considered  to  be  over  if  either  all  the  aircrafts  are 
destroyed,  i.e.,  {Xa  =  0}  or  all  the  targets  are  destroyed,  i.e.,  {T  =  0}. 

The  objective  of  blue  troop  is  to  make  re-allocation  decisions  over  time  so  as  to  min¬ 
imize  P(Xa  =  0)  and  maximize  P(T  =  0).  On  the  other  hand,  red  troop  wants  to 
maximize  P(Xa  =  0)  and  minimize  P(T  =  0)  by  allocating  its  SAM  units. 

Let  Ai  denote  the  state  space  of  lx  =  (Xa,  Xs ,T).  Also,  let  Ai*  and  dAi  denote 
classes  of  transient  states  and  absorbing  states,  respectively.  Then  Ai  =  Ai*  U  dAi. 

For  each  Ix,Iu  £  Ai,  define  the  re-allocation  cost  function 

G(Ix,Iu)  =  Ka\X?  -  f/f| -  KS\X[  -  U[\, 

Recall  that  Xf  +  Xf  =  f/f  +  f/f  and  Xf  +  Xf  =  f/f  +  f/f .  It  follows  that  G(Ix,  Iu)  = 
Ka |Xf  -  |  -  Ks |Xf  -  U$ | . 

Let  Ix(n)  denote  the  state  vector  at  time  n.  Define  the  stopping  time 

r  =  min{n  :  Ix(n)  6  dAi). 

We  also  define  the  terminal  cost 


V(Ix)  —  X{xa=0}  —  X{T= 0}- 

where  xa  is  the  indicator  function  of  a  set  A. 

Note  that 

E^(Ix(t))  =  P(Xa(r)  =  0)  -  P(T(t)  =  0). 


The  objective  function  can  be  given  as  follows: 


J(Ix,Ua(-),Us(-))  =  E 


j2G(Ix(n),Iu(n))  +  *(Ix(T)) 

n— 0 


The  red  troop  wants  to  maximize  J  and  the  blue  troop  wants  to  minimize  maxt/s  J. 
Let  £/  =  {/:  A4  — >■  R1  and  /  satisfies  the  boundary  condition  v  =  dt  on  dAi}. 
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Given  /  G  Q,  let 

min jja  maxjjs  |  P(Iui  J)f(J)  +  G(Ix,  Iu)  j  f°r  Ix  £  Af* 

T(/x)  for  Ix  G  dAd. 

where  P(/,  J)  is  the  transition  probability  from  state  /  to  state  J . 

The  associated  Isaacs  equation  is  given  by 

v  =  H(v). 


H(Wx)  = 


We  arrange  the  order  of  Ai  so  that  the  corresponding  transition  matrix  has  the  form 


P 


P*  P *  \ 

o  i  r 


where  Pj*  corresponds  to  the  transient  states  in  Ai*  and  I  is  an  identity  matrix  that 
corresponds  the  absorbing  states  in  dAi. 

Assumption  (A).  We  assume  that  all  eigenvalues  of  Pj*  are  inside  the  unit  circle. 

Assumption  A  implies  that 

IIPi&ll  <  «IHI 

for  any  vector  b  and  0  <  a  <  1,  where  the  norm  1 1 (6i , . . . ,  6*)  1 1  =  \Jb\  +  •  •  •  +  b\.  Clearly, 
a  determines  the  how  fast  the  state  vector  reaches  dM,  which  in  turn  determines  the 
convergence  rate  when  solving  the  corresponding  Isaacs  equation. 

Under  Assumption  A,  there  should  be  no  other  recurrent  states  other  than  the  ab¬ 
sorbing  states  in  dAi.  Namely,  the  game  has  to  come  to  an  end  with  either  aircrafts  or 
targets  destroyed. 

Theorem.  Under  Assumption  A,  the  following  assertions  hold. 

(1)  (Uniqueness)  The  Isaacs  equation  v  =  H(y)  has  a  unique  solution. 

(2)  (Convergence)  Given  no  G  G,  define  v\  =  H(y o).  In  general,  given  vn  G  Q  and 
define  vn+\  =  H(vn).  Then  vn  converges  to  v,  which  is  the  solution  to  the  Isaacs  equation 
v  =  H(y).  Moreover,  the  convergence  rate  is  of  order  a,  i.e., 

a11 

IK  -  v||  <  z — Ibi  -  wo||- 

1  —  a 


(3)  (Verification  Theorem)  Let  v  be  a  solution  to  v  =  H(v).  Then 


v{Ix)  <  Tmax(/x,  Ua(-))  :=  max  J (Ix,  Ua(-),  Us(-)) 

Us(-) 


Moreover,  let  P“  denote  a  minimizer  of 

4  E  P(Iu,J)f(J)  +  G(Ix,Iu) }. 


maxs 

us 
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Then  E/“  is  optimal,  i.e. 


V(IX)=  Jmax(/x,f/:('))- 


Initial  Allocation. 

In  practice,  it  is  important  for  both  sides  to  allocate  their  resources  appropriately  at 
the  beginning  of  the  game.  This  is  especially  the  case  when  the  re-allocation  cost  is  high. 
One  approach  within  the  framework  of  dynamic  games  can  be  given  as  follows. 

Let  v(Ix)  =  u(Xf , X2 , X[ , X2, T), T2).  For  fixed  (Ti,T2),  the  blue  troop  chooses 
(X“,  X2)  to  minimize  max(xpxf)  v(Ix)  subject  to  +  X2  =  Na. 

Given  these  (X“,X%),  the  red  troop  chooses  (Xl,X2)  to  maximize  v(Ix)  subject  to 
X(  +  XS2=NS. 

Example.  Take  Na  =  8,  Ns  =  8,  and  Ka  =  0.1,  Ks  =  0.2.  Let  Ia  denote  either 
(or  X2)  and  let  Is  denote  either  Xf  (or  X2).  We  take 

0  SF 

P(Ia  -)■  (Ia  -  1)1/0  =  - ; 

v  v  )\  )  0.3/s  _|_  i 


— - — .  Consider 
Po  +  1 

/I  0  0\ 

Pi  P2  0 

o  Pi  P2  ) 

Given  Ti  =  T2  =  2,  the  optimal  initial  allocation  is 

(Xi,  X%,  X*,  X2)  =  (4,4,4, 4). 


T  ,  0.05/  + 1  po 

Let  Po  =  n  or.  ,  ,  ,  Pi  =  — — r>  and  P2 


0.2  Is  +  1 


P  o  +  l 


P(T  — >■  T'|/a,  /s)  = 


Moreover,  given  the  state  vector,  the  optimal  policies  can  be  found  in  a  lookup  table. 
For  instance, 


if  {x^x$,xi,xz,tut2) 

changes  needed. 

If  (XlX-,Xf,XI,TuT2) 


(4, 4,  3,  5,  2,  2),  then  (E/f,  E/£,  E/f,  E/|) 
(4, 4,  3,  5,  0,2),  then  (E7f,  E7£,  E7J,  E7|) 


one  only  needs  to  move  4  aircraft  units  from  Ri  to  i?2. 


If  (X?,X£,X1s,X|,T1,T2)  =  (1,1,  5, 3, 0,2),  then  (E/f ,  E/^,  E/f,  E/|) 


move  1  aircraft  unit  and  1  SAM  unit  from  Ri  to  i?2. 


(4,  4,  3,  5);  no 
(0,8,3,  5),  i.e., 
(0,2, 4,  4),  i.e., 
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Lower  Level  re-allocation:  Let  us  discuss  briefly  lower  level  allocation.  We  only 
consider  the  re-allocations  between  regions  Rn  and  Ri2  because  the  situation  is  similar 
in  regions  R2 1  and  R22. 

In  this  case,  the  state  variables  are  the  number  of  targets  Tn,  Tj2  such  that  Tv  e  {0, 1}; 
blue  aircraft  units  X^,  X“2,  where  Xjt  denotes  the  total  number  of  aircraft  units  in  region 
Ru,  i  =  1,2  and  red  SAM  units  Xh,X(2,  where  Xjt  denotes  the  total  number  of  SAM 
units  in  region  Ru,  i  =  1,2. 

The  re-allocation  decisions  are  involved  to  change  (Xj\,X“2)  -+  (t/j\,  t/f2)  such  that 
X^  +  X“2  =  U\\  +  Uy2  for  blue  troop  and  change  (A'jj ,  A"f2)  — >  (f/f, ,  Uf2)  such  that 
X[x  +  X[2  =  17®i  +  17i2  for  red  troop. 

A  setup  cost  is  incurred  if  a  re- allocation  decision  is  made.  For  example,  if  la  aircraft 
units  are  moved  from  Rn  to  Ri2,  then  X“  •  la  is  incurred  and  Un  =  Xj\  —  la  and 
f/f2  =  X“2  +  la.  Similarly,  if  Is  units  are  allocated  from  Rn  to  i?i2,  then  K{  •  Is  is  incurred 
and  Uh  =  X®!  —  Is  and  Uf2  =  Xi2  +  Is.  Typically,  Xf  <  Ka  and  K{  <  Ks .  This  is 
because  the  cost  for  moving  within  a  region  is  greater  than  when  moving  between  regions. 

Transition  probabilities  and  objective  functions  can  be  determined  similar  to  that  for 
the  upper  level  model. 


Sensitivity  Tests  (Upper  Level  Only) 

We  use  the  Monte  Carlo  method  to  test  the  sensitivity  of  the  the  control  policies 
with  respect  to  changes  in  various  parameters.  Consider  the  model  with  a  set  of  “true” 
parameters  of  transition  probabilities  and  setup  costs.  We  generate  sample  paths  using 
these  true  parameters. 

Then  we  add  a  small  perturbation  £  to  each  of  these  parameters.  The  perturbed  model 
produces  a  set  of  control  policies  Ua,£ ,  Us,£  (this  is  a  time  consuming  part).  Such  control 
policies  lead  to  the  corresponding  theoretical  upper  value  Vs  and  Monte  Carlo  averages 
J£. 


The  following  parameters  are  used  in  the  tests. 
Setup  Cost:  Ka  =  0.1  +  O.le  and  Ks  =  0.2  +  0.1s\ 


Transition  Probabilities:  Let  Ia  denote  either  Xf  or  X2  and  let  Is  denote  either  Xf 
or  X|.  We  take 


p(ia  (r  -  i)|/s) 


(0.3  +  0  .le)Is 
(0.3  +  O.le)/*  +  1  ’ 


p(r  -+  ( is  - 1)| i°) 


(0.3  +  Q.le)Ia 
(0.3  +  0.1e)/a  +  1  ’ 


Let  p0 


(0.05  +  0  .le)Ia  +  1 

(0.2  +  0.1s)Is  +  1 


Pi 


Po 

Po  +  1’ 


P2 


1 

Po  T  1 ' 
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P(T  ^T'\Ia,Is) 


1  0  0  \ 
Pi  P2  0 

o  Pi  p2  ) 


The  numerical  results  are  summarized  in  the  following  tables.  Two  cases  are  consid¬ 
ered. 

Case  1:  Ix o  =  (X“,  X2a,  Xf,  X|,  Tu  T2)  =  (4,  4,  2,  6,  2,  2). 

In  this  case,  V°(IX o)  =  —0.6844  and  J°(Ixo)  =  —0.4910. 

Table  1.  Perturbations  in  Transition  Probabilities 


£ 

Vs 

|Ce  -  C0| 

J£ 

|  J£  -  J°| 

0.5 

-0.6732 

0.0112 

-0.4780 

0.0130 

0.1 

-0.6826 

0.0018 

-0.4930 

0.0020 

0.01 

-0.6842 

0.0002 

-0.4910 

0.0000 

Table  2.  Perturbations  in  Setup  Costs 


£ 

|C£  -  C°| 

J£ 

|  J£  -  J°| 

0.5 

-0.5626 

0.1218 

-0.3675 

0.1235 

0.1 

-0.6599 

0.0245 

-0.4701 

0.0209 

0.01 

-0.6819 

0.0025 

-0.4885 

0.0025 

Case  2:  Ix o  =  (X‘,  X$,  X{,  X|,  T1}  T2)  =  (4,  4, 4,  4,  2,  2). 
Now,  V°(IX o)  =  —0.7196  and  J°(IX o)  =  —0.6490. 


Table  3.  Perturbations  in  Transition  Probabilities 


£ 

Vs 

|c£  -  y°| 

J£ 

J£  -  J°| 

0.5 

-0.7120 

0.0076 

-0.5900 

0.0590 

0.1 

-0.7186 

0.0010 

-0.6330 

0.0160 

0.01 

-0.7195 

0.0009 

-0.6490 

0.0000 

Table  4.  Perturbations  in  Setup  Costs 


£ 

Vs 

|Ce  -  C°| 

J£ 

|  J£  -  J°| 

0.5 

-0.6121 

0.1075 

-0.5245 

0.1245 

0.1 

-0.6979 

0.0217 

-0.6191 

0.0299 

0.01 

-0.7174 

0.0022 

-0.6480 

0.0011 
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Observations:  As  can  be  seen  from  Tables  1-4, 

(1)  \V£  —  V°|  — »  0  as  £  — »  0;  this  suggests  that  the  value  function  Vs  is  continuous  in 

£. 

(2)  J°  is  close  to  V°;  this  means  that  the  averaged  value  function  using  the  Monte 
Carlo  method  is  close  to  the  corresponding  theoretical  value. 

(3)  |  J£  —  J°|  — y  0  as  e  — >  0;  this  suggests  that  one  can  do  nearly  as  good  as  the  optimal 
policies  when  using  the  perturbed  policies. 

These  tests  suggest  that  the  control  policies  obtained  are  robust  in  the  sense  that  the 
corresponding  outcomes  are  not  much  affected  when  using  close-to-real  parameters. 

How  to  determine  the  value  of  parameters? 

(1)  Transition  probabilities  can  be  determined  using  historical  data  and  standard 
statistical  tests. 

(2)  Setup  costs  can  be  derived  using  actual  dollar  amount  of  each  re-allocation  plus 
the  consideration  of  time  delay  needed  for  the  allocation. 

“Value.” 

(1)  Without  the  hierarchical  approach:  only  deal  with  2-4  aircraft  units  and  4-6  sam 
units;  Now  we  can  handle  much  larger  aircraft  and  SAM  units. 

(2)  Dramatic  reduction  of  computational  effort  and  time. 

(3)  The  hierarchical  approach  also  provides  initial  allocations,  which  makes  the  system 
free  from  near  future  re-allocation. 

(4)  Historical  data  and  human  expertise  can  be  used  to  model  a  real  life  scenario, 
which  makes  it  possible  to  use  a  computer  to  help  human  to  make  decisions  that  are  too 
hard  to  obtain  otherwise. 

Models  with  Other  Objectives  Functions:  We  may  also  consider  other  objective 
functions.  For  example,  one  may  consider  the  model  in  which  the  blue  troop  wants  to 
minimize  the  expected  exit  time  EJ2l=o  Pk ■>  where  r  =  min{n  :  T(n)  =  0}  and  0  <  p  <  1 
is  a  discount  factor  and  the  red  troop  wants  to  maximize  EY^l=oPk-  In  this  case,  the 
corresponding  Isaacs  equation  is 

v(Ix)  =  min  max  J 1  +  p  ^  P{IVj  J)v(J)  +  G(IX,  Iu)  \ , 
ua  113  {  JeM  J 


with  boundary  condition  v\ t=o  =  0. 

The  other  type  of  objective  function  typically  used  in  the  literature  is 


J  =  E 


E  fnixin))  +  pkG{Ix(n),  Iu(n )) 

k= 0 
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where  F(IX )  =  —  a(X“  +  X% )  +  b(X ®  +  X|)  +  C]Ti  +  c2T2  for  positive  constants  a,  b,  ci,  c2. 
The  associated  Isaacs  equation  is  given  by 

^(PO  =  min  max  \f(Ix)  +  p  £  P(WMJ)  +  G(/v,  P/)  j  • 

All  results  stated  in  the  above  theorem  can  be  extended  to  incorporate  these  two  cases 
except  that  there  is  no  need  for  Assumption  A  because  of  the  discount  factor  involved. 

4.2  Proofs  Related  to  the  Hierarchical  Technique 

In  this  appendix,  we  provide  proofs  of  results. 

For  any  /  e  Q  and  Ix  =  ( Xa,Xs,T )  e  At*,  Iu  =  ( Ua,Us,T )  e  At*,  define  the 
mapping 

«(/)(«  =  min  maxi  y  P(I„,J)f(J)  +  G(IX,  A,)} 

'-JeM  ' 

and  H(f)  satisfies  the  boundary  conditions. 

Proof  of  Theorem.  (Part  1).  Let  /, /'  £  Q.  Given  lx  £  At*,  there  exists  [/“  such 
that,  for  all  Us , 

H(S’)(Ix)  =  maxf  y  P(%,  J)f(J)  +  G(IX,  /“)}, 

where  /$■  =  (Xa,  Xs,  [/“,  t/s,  T).  Moreover,  there  exists  f/*  such  that 

max{  y  P(/?„J)/(J)  +  G(/A-./J)j=  y  P(/J,J)/(J)+G(/A,iJ), 

u  Pe-M  J  jg.m 

where  ip  =  (Xa,  Xs,  [i“,  P* ,  T).  Then  we  have 

HU)(ix)-HW)(ix)  <  y  P(rv,J)S(J)  +  G(ix,rv) 

jem 

-  y  p(ru,j)f(j)  +  G(ix,i'u) 

jem 

<  y  pttii,j)(f(j)-f(j)) 

jem* 

<«ii/-rii- 

Similarly,  replace  the  role  of  /  and  /',  we  obtain  the  opposite  inequality.  In  view  of  the 
assumption  on  P^,  it  follows  that 

\mf)-HU')\\<a\\f-f%  □ 

(Part  2.)  It  is  easy  to  show  that 

\\vn+i  -  vn\  \  =  ||  H(vn)  -  H{vn-i)  1 1  <  a\\vn  -  vn-i\\. 
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It  follows  that 


\\vn+1  -  vn\\  <  an\\vi  -  v0\\. 

This  implies  that  v  =  lim^oo  vn  exists  and  is  a  solution  to  the  Isaacs  equation  v  =  H(v) 
with  boundary  condition  w  =  $on  dAi. 

(Part  3.)  This  part  is  similar  to  the  proof  in  [4,  p.  14].  We  only  sketch  the  proof  for 
the  sake  of  completeness. 

Using  v  =  H(v),  we  have 

v(Ix{n))  <max{E[n(/x(n  +  l))|Ua(n),Us(n),/x(n)]  +  G'(/x(n),/C7(n))}. 

Us[n) 

The  equality  holds  if  Ua(n)  =  U*(n).  This  leads  to 

•'(Cv(0))<  max  J(/A-(0), £/“(•), £/■(•))■ 

u  {■) 


The  equality  holds  if  Ua(n )  =  U£(ri)  for  each  n.  This  completes  the  proof. 
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5  The  C  Estimation  and  Control  under  Imperfect 
Information 

Most  of  the  work  done  under  JFACC  was  control  (including  games)  under  full  state 
information  (i.e.  full  state  feedback).  However,  partial,  imperfect  and  even  purposefully 
corrupted  information  is  a  critical  part  of  warfare.  In  this  section,  we  address  estimation 
and  control  under  partial/imperfect  information.  In  the  latter  case,  we  take  particular 
care  to  address  the  presence  of  an  intelligent  adversary  in  the  system.  We  will  discuss 
some  simple  algorithms  for  estimation  of  system  state  given  likely  data  types  first.  Then 
we  will  discuss  control  under  imperfect  information  in  the  final  two  subsections. 

We  will  use  the  same  problem  as  given  in  the  previous  sections  where  the  Blue  player 
is  sending  aircraft  against  Red  player  SAMs  and  targets. 

The  first  basic  idea  is  that  the  players  handle  uncertainty  by  maintaining  probability 
distributions  on  the  location  and  number  of  their  opposing  player  forces.  As  in  most 
traditional  approaches  to  output  feedback  control,  these  probability  distributions  allow 
each  player  to  estimate  the  likely  states  of  his  or  her  opponent.  With  a  state  estimate,  one 
may  then  apply  the  control  derived  from  the  full  state  feedback  analysis.  This  approach 
is  by  far  the  most  common  treatment  of  control  under  partial  or  incomplete  information. 
It  remains,  however,  to  derive  the  state  estimator  from  the  probability  distributions. 

The  second  basic  idea  is  that  the  estimator  should  take  into  account  not  only  the 
likelihood  of  the  opponents’  states  but  also  the  risk  associated  with  those  states.  Encoded 
by  the  the  value  function,  the  risk  or  loss  associated  with  certain  states  is  computed 
in  the  full  state  feedback  game  situation.  Our  approach  integrates  the  estimation  and 
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control  to  balance  the  objective  function  and  its  measurement  of  risk  with  the  probability 
distributions  modeling  the  likely  states  of  the  opponent. 

In  this  section,  we  describe  the  estimation  and  output  feedback  problems  in  the  context 
of  command  and  control  applications.  We  also  provide  some  results  derived  from  a  detailed 
Monte  Carlo  simulation  of  the  processes  involved. 

5.1  The  Information  State  Variables 

The  formulation  of  this  stochastic  game  relies  on  two  separate  state  variables  for  each 
player:  the  ’’true”  state  and  the  information  state.  Each  player  maintains  knowledge  of 
his  own  state,  as  well  as  an  information  state  quantifying  his  uncertainty  in  his  opponent’s 
state.  Thus,  the  state  variable  s  E  SS  is  composed  of  four  components:  the  true  Blue 
state,  the  Blue  information  state  (estimation  of  Red),  the  true  Red  state,  and  the  Red 
information  state  (estimation  of  Blue).  The  true  states  have  been  discussed  above. 

The  Red  information  state  consists  of  track  filter  parameters  needed  to  estimate  the 
location  of  the  Blue  aircraft.  For  each  blue  aircraft  detected,  the  Red  information  state 
maintains  an  estimate  of  the  position,  the  aircraft  velocity,  and  the  covariance  of  these 
quantities. 

The  Blue  information  state  is  a  3  x  G  matrix,  whose  entries  are  the  probability  of  a 
Red  entity  of  a  given  type  at  a  particular  grid  point.  That  is,  bi^g  is  the  probability  of  a 
type  k  entity  at  grid  location  g ,  where  k  =  1  denotes  SAM,  k  =  2  denotes  emitter,  and 
k  =  3  denotes  nothing. 


5.2  Information  Probability  Modeling 

To  determine  the  appropriate  transition  probabilities,  we  will  need  probabilities  to  quan¬ 
tify  how  engagements  are  resolved  and  how  information  propagates. 

The  fundamental  probabilities  in  an  engagement  are  the  probability  that  a  Blue  at¬ 
tack  destroys  a  Red  entity  and  the  probability  that  a  Red  SAM  attack  destroys  a  Blue 
aircraft.  These  probabilities  have  been  described  above.  Here  we  focus  on  the  probability 
distributions  for  the  information  states. 

Detection  and  classification  probabilities  are  the  primary  entities  of  this  modeling 
effort.  Both  players  have  detection  problems.  The  Red  player  has  the  additional  problem 
of  establishing  a  track  on  the  Blue  aircraft,  while  the  Blue  player  has  the  problem  of 
discriminating  between  SAM  radars  with  track  and  defensive  missile  guidance  capability 
and  emitters,  which  are  decoys  emitting  a  signal  that  ’’sounds  like  a  SAM  radar.” 

We  denote  by  pd(w)  the  probability  that  a  Red  SAM  detects  an  aircraft  present  at  a 
distance  w.  Likewise,  we  denote  by  pbd(w )  the  probability  that  a  Blue  aircraft  detects  a 
defensive  entity  (SAM  radar  or  emitter)  that  is  turned  on  (i.e. ,  emitting  a  signal). 
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5.3  The  Estimation  Problem  for  Blue 


For  the  Blue  player,  we  assume  that  the  electronic  system  has  signal  processing  capability 
to  perform  a  classification  based  on  received  signals.  We  model  this  capability  with  a 
simple  and  flexible  two-class  statistical  discriminator,  which  is  encapsulated  in  statistical 
error  probabilities.  We  define  pbcc(w,  l)  to  be  the  probability  that  a  Blue  aircraft  correctly 
classifies  a  Red  entity  of  type  l  at  a  distance  w.  This  simple  model  applies  to  many,  if  not 
most,  radar-based  discrimination  schemes.  The  underlying  data  processing  of  the  radar 
returns  could  range  from  the  standard  linear  discriminator  to  a  neural  net  or  nearest- 
neighbor  classifier.  Any  of  these  schemes  will  have  probability  of  correct  classification. 

We  apply  the  standard  Bayesian  approach  to  modeling  the  information  state  updates. 
For  the  Blue  state,  we  begin  with  the  ’’observation  likelihood.”  That  is,  we  seek  to  de¬ 
termine  the  probability  that  a  SAM,  emitter,  or  nothing  is  at  grid  location  g  given  what 
we’ve  observed  at  the  current  time.  We  set 

na 

Pbo(-),9  =  -  7^)pbcc{wk,bshk,rs{.^g), 

k= 1  VV 

in  which  Wk  denotes  the  distance  between  the  k  —  th  aircraft  and  the  grid  location  g, 
W  =  J2k= l  wki  and  the  function  pbcc  is  the  probability  that  Blue  correctly  classifies  the 
entity  at  a  site.  Conceptually,  the  function  should  decrease  with  :  small  distance  should 
translate  into  more  accurate  classification.  The  dependence  on  the  Blue  state  should  be 
simple:  as  long  as  the  Blue  aircraft  is  alive  ( bs\tg  ^  —1),  then  pbcc  depends  only  on  the 
actual  Red  state  and  the  distance.  Then  the  Blue  information  update  formula  becomes 

=  pbo(l,g)bi(l,g) 

%l'g  Yg’  pbo(l,  g)bi(l,  g)  ’ 


through  an  application  of  Bayes’  rule. 


5.4  The  Estimation  Problem  for  Red 

We  assume  that  Red  observes  aircraft  existence  and  location.  We  assume  that  a  Red 
detection  of  an  aircraft  by  a  SAM  requires  that  that  particular  SAM  radar  be  on.  If 
the  SAM  is  on,  then  detection  of  any  given  aircraft  is  a  random  event  where  the  detect 
probability  depends  on  the  distance  from  the  SAM  to  the  aircraft.  In  particular,  we  take 
the  detect  probability  to  be 

r=  1 

Pd  1  +  (r/d)2 

where  r  is  the  distance  from  the  SAM  to  the  aircraft,  and  d  is  a  scaling  parameter.  We 
assume  that  if  there  is  a  detection,  then  Red  also  obtains  a  position  observation.  In 
particular,  we  simplify  the  problem  for  the  purposes  of  this  study  by  having  a  position 
observation  with  spherical  error  covariance  rather  than  taking  into  account  the  details  of 
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range,  azimuth  and  elevation  components  of  the  observation,  as  well  as  the  possibility  of 
doppler  measurements.  We  also  place  the  entire  problem  in  a  two-dimensional  space  (no 
altitude  component). 

The  Red  player  then  uses  a  sub-optimal  filter  to  track  the  Blue  aircraft.  The  Blue 
State  in  the  Red  filter  model  consists  of  a  situation  state,  Sr  taking  values  in  {1,  0,  B}  for 
“in  air” ,  destroyed  and  at  base,  as  well  as  a  position  vector  and  a  velocity  vector.  Ideally, 
the  filter  would  have  a  position/velocity  estimate  and  covariance  corresponding  to  each 
possible  path  of  Sr  up  to  the  current  time.  Of  course,  this  explodes  exponentially  as  time 
moves  forward,  and  so  we  take  the  standard  approach  of  only  carrying  a  finite  number  of 

these  along.  In  particular,  at  each  time  step,  the  filter  is  reduced  to  three  probabilities  for 

Sr,  Ps(t );  specifically,  Ps(t)  is  a  three- vector  with  for  instance,  P-f(f)  being  the  probabilty 
that  the  aircraft  is  such  that  £#(£)  =  1  (i.e.  the  probability  that  the  aircraft  is  in  the  air). 
Corresponding  to  each  element  of  this  three  vector  is  a  mean  position/velocity  vector  and 
a  corresponding  covariance  (a  4  x  4  matrix). 

The  position/velocity  means  and  covariances  are  updated  by  the  standard  Kalman 
filter  equations.  More  specifically,  we  assume  for  simplicity  that  the  SAM  uses  a  straight¬ 
forward  state  space  model  for  an  aircraft’s  dynamics: 

x(t  +  At)  =  x(t)  +  A  tv(t)  +  wx(t )  (4) 

v(t  +  At)  =  v{t)  +  wv(t),  (5) 

(6) 

in  which  x  and  v  denote  the  aircraft  position  and  velocity  vectors,  and  wx  and  wv  denote 
plant  noise  in  the  position  and  velocity  models.  For  the  observation  updates,  we  assume 
each  SAM  radar  observes  the  aircraftwhich  they  detect,  and  that  the  Red  defense  pools 
the  information  into  an  observation  vector  Y(t).  For  an  aircraft  at  position  x(t),  the 
components  of  Y ( t )  are 

Yi(t)  =  x(t)  +  £i(t) 

where  the  i  subscript  indicates  the  ith  SAM  radar’s  observation  of  position.  We  include 
the  simplest  case  here  to  begin  to  understand  the  effect  of  partial  information  on  game- 
theoretic  controls. 

Finally,  note  that  we  do  not  consider  the  track  association  problem  here.  In  other 
words,  we  assume  that  when  the  SAMs  receive  an  observation,  they  correctly  associate 
that  observation  with  the  corresponding  aircraft  that  was  observed.  The  track  association 
problem  is  not  relevant  to  the  study  we  are  making  here,  and  the  additional  complication 
would  be  detrimental  to  our  investigation  of  the  C2  problem  at  hand. 


5.5  Blue  Control  under  Imperfect  Information 

Having  defined  our  information  states  in  terms  of  probabilistic  models,  we  now  proceed 
with  the  tasks  of  developing  estimators  and  integrating  observers  into  the  control  system. 
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The  goals  of  this  research  project  involve  developing  and  understanding  control  strate¬ 
gies  for  air  operations.  Our  particular  interest  has  been  in  strategies  that  counter  intelli¬ 
gent  opponents  in  a  robust  way.  Toward  that  end,  we  seek  here  to  include  these  robustness 
consideration  into  the  estimation  problem:  that  is,  we  seek  estimators  that  balance  accu¬ 
racy  of  estimation  with  the  risk  associated  with  conducting  the  air  operation. 

Traditional  approaches  to  output  feedback  control  involve  the  separation  principle,  or 
the  certainty  equivalence  principle.  The  basic  idea  is  to  develop  feedback  controls  for  the 
full  state  feedback  problem  and  apply  them  replacing  the  state  with  a  state  estimator.  The 
most  common  estimator  used  is  the  maximum  likelihood  estimator.  It  is  well  known  that, 
for  linear  control  systems  with  quadratic  cost  criteria,  the  separation  principle  control 
coincides  with  the  optimal  control. 

Another  Certainty  Equivalence  Principle  exists  in  robust  control.  We  have  applied 
a  generalization  of  this  estimator,  discussed  below,  that  allows  us  to  tune  the  relative 
importance  between  the  likelihood  of  possible  states  and  the  risk  of  being  in  those  states. 
Let  us  motivate  this  in  a  little  more  detail. 

The  problem  of  Stochastic  Games  under  Partial  Observations  (without  resorting  to  re¬ 
placement  of  state  by  the  information  state,  which  is  hugely  higher  dimensional  -  infinite¬ 
dimensional  in  continuous  state  problems)  is  NOT  solved.  The  Certainty  Equivalence 
Principle  (sometimes  true  -  usually  not)  allows  one  to  separate  the  the  filtering  and  con¬ 
trol  components  to  some  extent.  In  deterministic  games  under  partial  information,  the 
Certainty  Equivalence  implies  that  one  should  use  the  optimal  control  corresponding  to 
the  state  given  by 

x  e  argmax  [P(t,  x)  +  V ( t ,  a;)] 

where  P  is  the  information  state  and  V  is  the  value  function  (assuming  uniquueness  of  the 
argmax  of  course).  Here,  the  information  state  is  essentially  the  worst  case  cost-so-far, 
and  the  value  is  the  minimax  cost-to-come.  So,  heuristically,  this  is  roughly  equivalent  to 
taking  the  worst-case  possibility  for  total  cost  from  initial  time  to  terminal  time.  (See,  for 
instance,  James  et  ah,  and  McEneaney  ([21],  [20],  [28],  [29].)  The  next  three  paragraphs 
discuss  the  mathematics  which  lead  to  the  heuristic  for  the  algorithm  described  in  the 
fourth  paragraph  below.  Readers  uninterested  in  these  details  should  skip  directly  to  the 
fourth  paragraph  below. 

The  deterministic  information  state  is  very  similar  to  the  log  of  probability  density  in 
stochastic  formulations  for  terminal/exit  cost  problems.  (In  fact,  this  is  exactly  true  for 
certain  linear/quadratic  problems.) 

A  risk-averse  stochastic  control  problem  is  given  by 

<%t  =  im ,  u(t))  dt + vMew)  dwt 

to=X 

J£  (x,  u)  =  e  log  E  je i-GfGW))  | 

V£(x)  =  inf  Je(x,  u ). 
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This  risk-averse  stochastic  control  problem  is  equivalent  to  the  stochastic  game: 

d£t  =  [/(£(*),  u(t ))  +  <r(^(i))to(i)]  dt  +  y/ea(^(t))  dWt 

Co  =  X 

Je(x,U,W )  =  E{L(£(-),u(-))  -  ^IMI2} 

V£(x)  =  inf  sup  Je(x,  u,  w). 

w 

Both  have  the  same  Dynamic  Programming  Equation: 

0  =  Vt  +  e  ^2(<ro’T)ijVXitXj 

hj 

+  inf  { [f(x,  u)]T W  +  L(x,  «)} 

+  sup{[cr(x)w;]TVR  —  -|«;|2} 

w  ^  2  J 

=  Vt  +  £^2((x<xT)i,jVXitXj  +inf  {[f(x,u)]TW  +  L(x,u)} 

hj 

+  \[VV]TaaTVV. 

Lj 

It  is  by  now  well-known  that  risk-averse  control  converges  to  a  deterministic  game  as 
£  4,  0  ([11],  [12],  [13],  [32]).  All  of  this  lends  credibility  to  a  study  of  the  use  of  the  above 
Certainty  Equivalence  approach  for  our  problem  (although  it  will  be  sub-optimal). 

In  the  stochastic  linear /quadratic  problem  formulation,  the  information  state  at  any 
time,  t,  is  characterized  as  a  Gaussian  distribution,  say 

p(t,  x)  =  k(t)  expj  —  l(x  —  x(t))TC~1(t)(x  —  x(t))|. 

In  the  deterministic  game  formulation,  the  information  state  at  any  time,  t,  is  character¬ 
ized  as  a  quadratic  cost,  say 

P(t,  x)  =  —\{x  —  x(t))T Q(t)(x  —  x(t))  +  r(t). 

Interestingly,  Q  and  C~l  satisfy  the  same  Riccati  equation  (or,  equivalently,  Q_1  and  C 
satisfy  the  same  Riccati  equation),  x  and  x  satisfy  identical  equations  as  well.  Therefore, 
P(t,x )  =  log[p(t, x)\  +  “time-dependent  constant”. 

The  above  three  paragraphs  form  the  (partially)  heuristic  argument  behind  our  algo¬ 
rithm.  This  algorithm  is:  apply  state  feedback  control  at 

argmax{log[p(f,  x)]  +  kV  ( t ,  x)} 

where  p  is  the  probability  distribution  based  on  the  above  observation  process  and  filter 
for  Blue  (or  Red),  and  V  is  state  feedback  stochastic  game  value.  Here,  k  E  [0,  00)  is 
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a  measure  of  risk- aversion.  Note  that  k,  =  0  implies  that  one  is  employing  a  maximum 
likelihood  estimate  in  the  state  feedback  control  (for  the  game),  i.e. 


argmax{log[p(t,  a:)]}  =  argmax{p(t,  x)}. 

Note  also  (at  least  in  linear-quadratic  case  where  log p(t,  x )  =  P{t ,  x)  (modulo  a  constant), 
k  =  1  corresponds  to  the  deterministic  game  Certainty  Equivalence  Principle,  i.e. 

argmax{P(f,  x)  +  V (t,  re)}. 

As  k  — >  oo,  this  converges  to  an  approach  which  always  assumes  the  worst  possible  state 
for  the  system  when  choosing  a  control  -  regardless  of  observations. 

Assuming  Certainty  Equivalence  allows  us  to  use  our  earlier  experimental  result  (see 
above  sections):  The  optimal  Blue  strategy  is  always  either  rollback  or  fly-over.  This 
reduces  our  search  over  Blue  controls  by  an  order  of  magnitude  for  our  problem. 


5.6  Numerical  Experiments  with  Robust  Blue  Control  under 
Imperfect  Information 

We  have  developed  a  simulation  for  the  partially  observed  problem,  which  uses  as  an 
input  the  full  state  feedback  controls  computed  using  the  software  described  in  previous 
sections.  However,  now  for  the  Blue  controller,  we  combine  the  estimator  and  controller 
via  the  risk-averse  technique  described  in  the  previous  section.  The  simulation  generates 
observations  and  battle  outcomes  according  to  the  appropriate  probability  models  and 
evolves  the  information  states  as  the  engagement  progresses.  The  controllers  observe  the 
state,  and  input  the  controls  accordingly. 

We  note  here  that  the  new  control  software  for  the  partially  observed  stochastic  game 
allows  any  number  of  SAMs  up  to  6,  that  is  without  needing  the  hierarchical  control.  (The 
simulator  and  estimator  do  not  have  any  hard  bounds  on  the  number.)  It  also  allows  any 
number  (up  to  number  of  grid  points)  of  possible  SAM/Emitter  locations.  However,  a 
practical  detail  is  that  one  needs  to  store  “tables”  for  each  possible  geometry  distillation 
of  6  SAMs.  The  maximum  number  of  Blue  aircraft  and  Red  targets  is  two  each  (without 
the  hierarchical  controller).  Recall  that  the  geometry  distillation  describes  which  Red 
entities  lie  under  which  other  SAM  umbrellas.  (Many  different  geometries  may  have  the 
same  distillation.) 

The  example  below  has  only  a  few  SAMs  and  decoys,  but  this  is  not  necessary.  One  has 
the  standard  exponential  growth  in  computation  with  number  of  aircraft  (or  packages), 
number  of  SAMs  and  number  of  targets.  One  has  slower  growth  in  real-time  computation 
with  number  of  decoys. 

Figure  5.1,  which  is  a  snapshot  of  the  simulation  in  progress,  illustrates  the  process. 
Included  in  this  image  are  aircraft  (in  black),  SAM  sites  with  radar  (in  pink  if  on,  red 
if  off),  emitter  sites  (in  cyan  if  on,  blue  if  off),  and  targets  (in  magenta).  The  black 
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Fig.  5.1 


circles  indicate  kill  radii  for  the  SAMs.  The  bar  graphs  to  the  right  of  the  battle  cartoon 
indicate  the  likelihoods  for  Blue  for  each  site:  Blue  must  estimate  based  on  observations 
the  probability  that  a  site  is  a  SAM  radar  or  an  emitter  decoy.  Specifically,  the  red/pink 
bars  indicate  the  probability  that  the  site  is  a  SAM,  and  the  blue/cyan  bars  indicate  the 
probability  that  it  is  an  emitter.  (We  also  allow  a  probability  that  there  is  nothing  at  that 
location.)  Also  pictured  are  green  circles  which  give  the  2 a  radii  of  the  aircraft  position 
estimates  for  the  Red  information  state. 

Applying  this  simulation  for  many  Monte  Carlo  engagements,  we  can  assess  the  ex¬ 
pected  value  for  a  particular  scenario.  In  Figure  5.2,  we  have  selected  a  scenario  which  has 
3  SAM  sites,  2  emitters,  and  2  aircraft  attacking  two  targets.  Running  the  simulation  for 
2000  Monte  Carlo  samples,  we  can  assess  the  impact  of  the  risk-averse  estimator  weight 
parameter  k,  on  the  outcome.  The  plot  below  shows  that  there  is  an  optimal  value  in 
between  applying  the  straight  maximum  likelihood  estimator  and  the  k  — >  oo  approach 
(which  ignores  all  observations  -  assuming  the  worst-case  state) .  Applying  the  traditional 
separated  controller/estimator  approach  (k  =  0)  produces  reduced  performance,  which 
means  that  the  Blue  player  is  more  likely  to  lose  aircraft  under  this  approach  than  under 
the  risk-averse  combined  controller/estimator  of  the  previous  section,  which  takes  a  more 
game-theoretic,  risk-averse  approach.  Note  that  the  horizontal  axis  is  on  a  log  scale,  so 
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Value  vs.  log(kappa) 


that  the  minimum  in  n  is  rather  broad. 

6  A  State  Estimator  for  the  upper  Hierarchical  Lev¬ 
els 

In  this  section,  we  consider  a  filter  for  a  higher  level  in  the  problem  hierarchy  in  which  the 
Red  state  is  not  completely  available.  Instead,  the  state  with  additive  noise  is  observable. 

Let  Xn  denote  the  number  of  SAMs  (or  state  of  targets)  in  a  given  region  at  time  n, 
which  is  not  directly  observable.  One  only  observes  Yn  =  Xn  +  noise.  The  objective  is  to 
estimate  Xn  using  information  {To,  Yi, . . . ,  Yn}. 

Let  Xn  £  {0, 1, . . .  ,m}  be  a  Markov  chain  with  transition  probability  matrix  P  = 
(Pij)(m+i)x(m+i)-  The  observation  process  is 

Yn  =  Xn  +  a(n,  Xn)Wn,  n  =  0, 1,  2, ... , 

where  a  is  a  function  of  n  and  x,  and  {Wn}  is  a  sequence  independent  Gaussian  1V(0, 1) 
random  variables. 

Algorithm. 
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Remark.  This  algorithm  is  derived  from  an  exact  optimal  filter  in  a  continuous¬ 
time  model.  It  provides  fast  and  in  the  meantime  fairly  reliable  estimates  for  finite-state 
Markov  chain. 

Using  the  conditional  probabilities,  we  define  the  maximum  likelihood  estimate  of  Xn 
X™ax  =  i  if  7Tj(n)  =  max{7rj(n)  :  j  =  0, 1, . . . ,  m}. 

We  also  define  the  conditional  mean 

m 

xn  =  Y^j(n) 

3=0 

and  its  integer  version 

Xn  =  i  if  i  —  —  <  Xn  <  i  +  -  for  i  =  0, 1, . . . ,  m. 

Remark.  Note  that  it  is  difficult  to  obtain  a  recursive  form  for  Xn  because  the 
underlying  filtering  problem  is  nonlinear. 

Numerical  Experiments. 

In  our  numerical  simulations,  we  take  m  =  8  and 

0.3  0.3  0.2  0.2  0  0  0  0  0  \ 

0.2  0.3  0.2  0.3  0  0  0  0  0 

0  0.3  0.5  0.2  0  0  0  0  0 

0  0  0.2  0.6  0.2  0  0  0  0 

P  =  0  0  0.2  0.5  0.3  0  0  0  0 

0  0  0  0  0.2  0.6  0.2  0  0 

0  0  0  0  0  0.3  0.5  0.2  0 

0  0  0  0  0  0  0.1  0.7  0.2 

0  0  0  0  0  0  0  0.2  0.8/ 
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We  consider  a(n,x )  =  0.2 x  +  a0.  We  fix  X0  =  8  and  take  the  initial  distribution  (with 
given  i0)  to  be  of  the  form  p°o  =  1  and  =  0  for  i  ^  i0. 

First,  choose  i0  =  8  and  vary  a0,  i.e.,  we  have  precise  initial  estimate  on  {p°}.  The 
dependence  on  observation  noise  is  given  in  Table  5. 

Table  5.  Dependence  on  observation  noise. 


cr  o 

0.1 

0.5 

1 

1.5 

2 

E(Xn  -  Xn)2 

0.929 

1.018 

1.163 

1.306 

1.446 

E(Xn  -  xrx)2 

0.711 

0.840 

1.024 

1.244 

1.447 

It  is  clear  that  the  larger  the  observation  noise,  the  greater  the  estimation  error. 

In  Figures  1.1  and  1.2,  sample  paths  and  the  corresponding  conditional  probabilities 
are  plotted  (with  cr0  =  0.5).  Then  in  Figures  2.1  and  2.2,  we  plotted  the  same  functions 
with  (Jo  =  2.  In  this  case,  the  observation  noise  is  much  greater  (as  seen  on  Fig  2.2). 
However,  the  filtering  algorithm  performs  quite  well. 

It  is  interesting  to  see  (Figures  1.2  and  2.2)  how  the  conditional  probabilities  evolve 
following  the  state  jumps. 


Then,  we  fix  <Jo  =  0.5  and  vary  i o.  We  examine  how  the  algorithm  works  with  poorly 
chosen  initial  distributions. 

Table  6.  Dependence  on  initial  probabilities. 


*0 

0 

1 

2 

3 

4 

5 

6 

7 

8 

E(Xn  -  Xn)2 

3.003 

2.678 

2.561 

1.950 

1.550 

1.310 

1.137 

1.036 

1.018 

E(Xn  -  X™ax)2 

2.380 

2.181 

2.056 

1.558 

1.233 

1.057 

0.915 

0.851 

0.840 

As  can  be  seen  from  this  table  that  error  increases  continuously  as  the  choice  io  moves 
away  from  the  true  value  i0  =  8.  However,  as  shown  in  Figures  3.1  and  4.1,  these  errors 
disappear  quickly  in  both  cases  with  a  =  0.5  and  a  =  2. 


Sensitivity  Tests. 

First  we  consider  sensitivity  with  respect  to  observation  noise 

a(n,  x)  =  (0.2  +  e)x  +  (0.5  +  e). 


We  fix  i0  =  8. 


Table  7.  Perturbations  in  Observation  Noise 


£ 

0.5 

0.4 

0.3 

0.2 

0.1 

0.05 

0.01 

0 

E(Xn-Xn)2 

1.221 

1.162 

1.112 

1.065 

1.027 

1.021 

1.013 

1.018 

E(Xn-Xrx)2 

1.496 

1.310 

1.153 

1.022 

0.895 

0.855 

0.838 

0.840 
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cond  prob[8]  cond  prob[7]  cond  prob[6]  cond  prob[5]  cond  prob[4]  cond  prob[3]  cond  prob[2]  condprob[1]  cond  prob[0] 


Next  we  fix  a(n,x )  =  0.2a:  +  0.5  and  i0  =  8.  We  add  a  perturbation  to  transition 
matrix  P: 


/  0.3  +  s  0.3  —  s 

0.2 

0.2 

0 

0 

0 

0 

0  A 

0.2 

0.3  -e 

0.2 +  e 

0.3 

0 

0 

0 

0 

0 

0 

0.3 

0.5  -e 

0.2  +  e 

0 

0 

0 

0 

0 

0 

0 

0.2 

0.6  -e 

0.2  +  £ 

0 

0 

0 

0 

0 

0 

0.2 

0.5  —  e 

0.3  T  £ 

0 

0 

0 

0 

0 

0 

0 

0 

0.2 

0.6- 

£  0.2  T  £ 

0 

0 

0 

0 

0 

0 

0 

0.3 

0.5 -e 

0.2 +  £ 

0 

0 

0 

0 

0 

0 

0 

0.1 

0.7-5 

0.2  +  5 

0 

0 

0 

0 

0 

0 

0 

0.2  +  5 

0.8-5/ 

The  numerical  results  are  as  follows: 

Table  8.  Perturbations  in  Transition  Probabilities. 


£ 

0.2 

0.1 

0.05 

0.01 

0 

E(Xn  -  Xn)2 

1.278 

1.002 

0.999 

1.016 

1.018 

E(xn  -  xrx)2 

1.141 

0.830 

0.821 

0.835 

0.840 

These  results  demonstrate  that  the  algorithm  is  robust  in  the  sense  that  the  output 
depends  on  parameters  in  a  continuous  fashion.  The  implication  is  that  one  does  not 
have  to  have  exact  value  of  various  parameters  a  and  P.  Approximations  will  do  nearly 
as  good. 

In  the  completely  observable  case,  the  optimal  allocation  for  blue  troop  is  given  by 

(U?(n),U$(n))  =  U:(Ix(n ))  =  U^X^n),  X^(n),  X^n),  X^n),^),^)). 

When  the  states  for  targets  and  SAMs  are  not  completely  observable,  we  consider  the 
model 

YiT(n)  =  Ti(n)  +  <rftuf(n), 

Y2  (n )  =  T2  (n)  +  ol wl (n) , 

Yi(n)  =  X((n)  +  aiwi(n)} 

Yi(n)  =  xf(n)  +  a  s2ws2{n). 

Then  our  nonlinear  filtering  scheme  gives 

(+,+++) 

One  may  replace  the  state  variables  be  these  estimates  and  use 

(Ui(n),U^(n))  =  U;(X?(n),  X2»,  X?(n),  A|(n),  f^n),  f2(n)) 

to  control  the  actual  system.  Such  approach  typically  leads  to  close  to  optimal  policies; 
see  [35]. 
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7  A  Means  of  Evaluating  Various  Approaches  to  the 
C2  Problem 

Command  and  Control  (C2 )  problems  in  the  military  domain  are  now  being  addressed 
via  modern  control  techniques.  More  specifically,  one  may  view  the  battle  as  a  plant  to  be 
controlled  toward  some  goal.  Obviously,  the  plant  is  an  extremely  large  system  involving 
both  man  and  machine  components  interacting  over  some  changing  physical  space.  The 
level  of  detail  and/or  abstraction  that  commanders  at  various  levels  face  is,  of  course, 
highly  variable.  The  choice  of  plant  model  for  any  such  system  is  quite  unclear.  Further, 
a  commander  faced  with  some  plant  state  may  have  an  array  of  controllers  available,  each 
suggesting  an  action  different  from  the  others  in  some  way  (see,  for  instance,  [6],  [18], 
[16],  [22],  [33],  [34],  [5],  [17],  [37]).  The  question  is  how  to  choose  among  such  an  array 
of  controllers.  One  may  consider  the  very  large  space  of  all  plants  consistent  with  the 
battle  under  consideration.  By  plants  one  of  course  means  not  just  the  variables,  but  also 
the  dynamics  by  which  the  system  propagates.  With  this  view,  the  question  becomes: 
in  which  region  of  (true)  plant  space  should  a  given  controller  be  used  in  preference 
to  another?  Although  this  is  obviously  a  very  difficult  problem,  one  may  be  able  to 
make  some  progress  in  a  rigorous  fashion.  The  resulting  algorithms  (selecting  controllers 
dependent  on  location  in  plant  space)  might  best  be  termed  a  meta-controller. 

Even  in  the  most  general  sense,  the  best  control  formulation  may  not  be  at  all  clear.  For 
instance,  one  controller  might  choose  to  assign  value  to  the  elimination  of  enemy  surface- 
to-air  missiles  (SAMs),  while  another  may  eliminate  SAMs  as  a  means  of  reaching  a  goal 
target  (whose  elimination  may  have  value)  at  minimum  risk;  that  is,  there  may  not  be 
any  term  in  the  cost  criterion  corresponding  to  the  value  of  a  SAM.  As  another  example, 
one  controller  may  focus  solely  on  attrition  while  another  may  directly  assign  value  to 
geographical  position  of  assets.  The  general  form  of  the  cost  may  be  different  between 
controllers.  For  instance,  one  may  attempt  to  minimize  a  discounted  cost  criterion  while 
another  may  attempt  to  control  to  some  desired  exit  criterion.  Further  still,  and  even 
more  interesting,  one  controller  may  model  certain  stochastic  elements  of  the  dynamics 
while  another  may  focus  on  a  deterministic  game  formulation. 

This  section  will  only  begin  to  address  the  problem.  Clearly,  when  military  planners 
consider  the  various  approaches  and  the  arguments  of  each  approache’s  ardent  proponents, 
decisions  will  need  to  be  made  regarding  the  best  approach.  Consequently,  we  consider 
here  an  initial  outline  of  the  problem,  and  some  potential  tools.  It  is  hoped  that  more 
progress  can  be  made  so  that  the  military  planners  can  have  some  aids  in  determining 
the  “best”  approaches. 

We  remark  that  the  approach  to  be  described  here  is  one  where  the  true  plant  is 
likely  quite  different  from  the  controller  model  since  nonlinear  control  problems  can  only 
be  solved  (even  numerically)  for  relatively  low-dimensional  systems.  However,  one  can 
generate  very  complex  computer  simulations  of  a  true  system,  and  this  has  certainly  been 
done  for  military  C2  problems.  Thus,  the  approach  described  below  would  be  used  with 
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a  “true”  plant  being  generated  by  a  very  detailed  computer  simulation  with  the  dynamic 
models  being  used  by  the  controllers  being  much  simpler.  Throughout,  we  will  refer  to 
this  high-fidelity  simulation  as  the  real  world  state;  this  enables  one  to  use  Monte  Carlo 
techniques  to  study  the  real  world  outcomes  when  appropriate. 

The  optimization  analysis  at  a  general  level  is  rather  simple,  and  appears  in  Subection 
7.1.  In  Subsection  7.2,  this  is  extended  to  a  switching  meta-controller.  In  Subsection 
7.3,  extensions  to  game  models  are  discussed  briefly.  We  consider  a  small  example  in 
Subsection  7.4  to  indicate  that  the  concepts  are  not  vacuous. 


7.1  Optimization  Analysis 

We  now  display  the  relatively  simple  analysis  which  could  be  used  to  compare  the  effec¬ 
tiveness  of  two  controllers  which  may  have  completely  different  plant  models  with  different 
criteria  which  are  being  optimized. 

One  must  have  a  true  plant  whose  state  is  represented  at  time  t  as  Xw(t)  taking  values 
in  some  set  Xw .  The  controls  which  the  commander  could  use  in  this  true  plant  (again 
actually  a  high  fidelity  simulation  of  the  real  world)  are  denoted  as  uw(t).  There  must 
exist  some  dynamics  for  this  true  plant.  Rather  than  specify  particular  dynamics,  we  will 
attempt  to  keep  the  discussion  rather  general,  allowing  a  variety  of  dynamic  models  such 
as  deterministic  models 


d  yw 

—  =  f(t,Xdt),u™(t)) 

(7) 

Xw(tk+i)  =  f(tk,Xw(tk),uw(tk)) 

(8) 

where  in  the  latter  case,  we  will  assume  the  state  is  defined  over  continuous  time  as 
Xw(t)  =  Xw(tk)  for  all  t  £  (4-i,  4]-  We  will  assume  that  there  exist  unique  solutions  for 
all  feedback  controls  uw(-)  to  be  supplied  by  the  control  algorithms.  We  will  also  allow 
stochastic  analogues  of  these  dynamics  such  as 

dX™  =  f(t ,  Xw(t),  uw(t ))  dt  +  a(t,  Xw(t))  dBt  (9) 

and 

Xw(tk+i)  =  Xw(tk)  +  f(tk,Xw(tk),uw(tk))  +  a(t,Xw(t))W(tk).  (10) 

Here,  B.  would  be  a  vector- valued  Brownian  motion,  and  IT (•)  would  be  a  vector- valued 
random  sequence.  Again,  we  assume  that  all  conditions  for  existence  and  uniqueness  of 
solutions  hold  for  any  controls,  uw,  generated  by  the  control  algorithms.  (For  instance,  in 
the  diffusion  case,  we  presume  the  control  designers  restrict  themselves  to  progressively 
measurable  controls,  but  this  will  not  be  the  main  focus  of  this  paper,  so  we  ignore  the 
details  for  now.)  Again,  this  paper  will  only  attempt  to  lay  out  a  problem  that  has 
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appeared  in  the  C2  area  in  a  general  way,  and  so  we  minimize  discussion  of  the  associated 
mathematical  machinery. 

Now,  if  one  is  going  to  apply  a  control  algorithm,  say  algorithm  i,  one  must  have  a 
model  of  the  true  plant  with  which  the  controller  will  be  computed.  Let  the  state  (at  time 
t)  in  this  model  for  control  algorithm  i  be  denoted  by  Xl(t),  taking  values  in  X1.  Then, 
in  order  to  compute  the  corresponding  control,  one  must  specify  a  mapping  from  the  real 
world  state  to  the  model  state.  That  is,  the  control  designer  must  specify  a  well-defined 
mapping 

Mi  :  Xw  ->•  X\  (11) 

Thus,  given  plant  trajectory  Xw([to,t])  (where  to  will  denote  the  initial  time),  the  control 
at  time  t  will  be  designed  on  the  basis  of  X*([f0,  t]).  Then,  via  some  algorithm  (algorithm 
i  in  this  case),  the  controller  will  compute  the  control  to  be  applied  at  time  t,  ul{t)  with 
ul{t)  £  U\  or  possibly  Uxi^  in  the  state-dependent  control  set  case.  Note  that  one  may 
have  ul(t)  =  Ft(Xt(t ))  or  possibly,  ul(t)  =  Ft(Xt([to,  f]))  if  the  control  depends  on  the 
entire  trajectory  rather  than  just  the  current  state.  (For  now,  we  do  not  specifically 
denote  an  observation  process.) 

Next,  to  actually  apply  the  controls  back  in  the  real  world,  the  designer  must  also 
specify  some  mapping  from  W  (or  Uxi^)  into  Uw  (or  Uxw^  in  the  state-dependent  control 
set  case).  That  is,  the  designer  must  specify 

N*  :W  ->•  Uw 

or  in  the  state-dependent  control  set  case, 

where  the  subscript  on  N  is  necessary  since  the  specification  of  the  mapping  must  be  such 
that  the  range  is  restricted  to  Uxwyy  Thus,  (in  the  case  of  state-dependent  control  set) 

<(t)  =  N'x„moF’°M'(X’»(t))  (14) 

if  the  controller  only  depends  on  the  current  state,  and 

«?(♦)  =  Xx-(t)°F'  °  {M‘[A-"(r)]([«„.i])} 

otherwise  (where  r  is  a  dummy  variable). 

We  will  assume  throughout  that  the  commander  may  supply  an  exit  set,  £w,  and 
certainly  will  supply  an  exit  time 

tw  =  min  {T,  inf{t  :  Xw(t)  G  Sw}}  , 

where  T  <  oo  (and  obviously  tw  =  T  if  no  exit  set  is  specified).  In  the  simplest  sit¬ 
uation,  the  commander  or  planner  may  have  a  very  specific  cost  criterion  in  mind,  say 


(15) 


(12) 

(13) 
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Lw(Xw([to,Tw],uw([to,Tw])  which  they  may  wish  to  minimize.  For  example,  let  us  sup¬ 
pose  that  the  dynamics  of  the  real  world  are  stochastic,  and  further,  let  us  suppose  that 
one  wishes  to  minimize  a  criterion  such  as 

Cw(h,  x%,  A)  =  E  {Lw(Xw([t0,  tw],  uw([t0,rw})} 

with  Xw(t0 )  =  Xq  where  A1  represents  the  triple  A1  =  (M\F\NZ).  Then  the  problem 
is  simple.  For  instance,  suppose  one  only  needs  to  compare  two  control  algorithms,  A1 
and  A2.  The  problem  is  reduced  to  merely  comparing  £w(to,  xff,  A1)  and  £w(to,  xff,  A2) 
at  the  initial  time  to  see  which  yields  the  minimum  expected  payoff.  (Note  that  there 
is  actually  a  time-dependent  generalization  to  this  where  one  may  use  such  a  criterion 
as  a  switching  criterion  in  a  meta-controller.  Some  initial  development  in  this  direction 
appears  in  the  next  section.)  One  might  remark  that  the  minimizing  control  algorithm 
might  vary  depending  on  the  initial  state. 

Unfortunately,  the  criterion  which  should  be  minimized  may  not  be  obvious.  The 
problem  under  consideration  may  be  only  a  small  part  of  an  overall  conflict,  and  the 
criterion  may  be  quite  loosely  defined,  as  mentioned  in  the  introduction.  At  the  early 
stages  of  an  examination  of  this  overall  problem  of  applying  control  in  a  C2  environment, 
the  military  will  not  give  a  specific  criterion  to  be  minimized.  Further,  the  plants  to  be 
studied  may  be  quite  variable  -  encompassing  a  wide  variety  of  military  scenarios.  The 
control  designers  themselves  must  propose  criteria  which  the  controllers  will  minimize 
(based,  of  course,  on  communication  with  operational  experts).  Note  that  rw  may  not 
correspond  to  the  exit  time  for  the  control  models.  It  is  required  that  each  controller  have 
some  cost  criterion  which  it  is  attempting  to  minimize.  Let  the  cost  be  Ll(Xt(-),ul(-)). 
This  cost  must  be  well-defined  for  all  possible  paths  X*(-).  In  particular,  if  there  is  an  exit 
set  £l  and  exit/terminal  time  r\  then  Ll  must  be  defined  for  paths  such  that  tw  <  t1. 
Alternatively,  if  X1  enters  El  at  some  time  tl  <  tw,  then  the  controller  must  define  some 
“zero  control”,  ul0  to  be  applied  while  Xl(t)  £  E\  Lastly,  we  will  assume  that  the  range  of 
L\  7 Z(U),  to  satisfy  7 Z(Ll)  C  [0, 1].  (Note  that  there  are  a  variety  of  simple  mappings  to 
achieve  this  range  condition  if  it  is  not  natural.)  We  note  that  one  would  typically  expect 
L*(X*(-), «*(•))  =  0  to  indicate  an  outcome  with  no  Blue  losses  and  total  rout  of  Red 
forces,  and  vice-versa  for  7?(X*(-),  «*(•))  =  1,  so  a  fixed  range  condition  is  quite  natural 
in  the  context  of  a  C2  problem. 

Now  we  return  to  the  commander’s  predicament  in  the  absence  of  an  Lw .  Let  X™(-)  : 
[0,  tw]  — y  Xw  be  the  path  generated  by  control  algorithm  A1,  i.e.  by  applying  control 
uf  given  by  (14), (15)  for  all  t  £  [0,  rw].  Also  suppose  that  [W]-1  exists  (and  assume 
that  Nl  is  not  state-dependent).  Then  define  the  cost  of  considering  control  algorithm  A 7 
according  to  the  cost  metric  supplied  by  controller  i  denoted  (in  the  deterministic  case) 
as  follows  (where  we  drop  the  t0,Xg  arguments  for  simplicity) 

C(Aj)  =  U[M\XJ{-)),  [A/T'KO)]  (16) 

where 
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(17) 


=  M‘(A7(i))  Vt  e  [0, tw] 

m-'wmt)  =  m-'ww)  v*  e  [o,t»]. 

Note  that  if  the  real  world  dynamics  are  random,  with  some  underlying  probability  space 
(Q,  P ),  then  one  actually  has  X™(-)  =  [X”'(-)](o>)  where  wGfi,  and  one  should  modify 
(16)  to 

C(A’)  =  e{L'{M'(XJ(-)),  Wl-'fuJO)]},  (18) 

in  which  case,  Cl{A^)  is  the  expected  cost  of  considering  control  algorithm  A J  according 
to  the  cost  metric  supplied  by  controller  i.  We  remind  the  reader  that  this  may  depend 
on  the  initial  state  Xw(to)  =  . 

Now,  one  would  typically  expect 


CiA*)  <  C(Aj)  (19) 

Cj(Aj)  <  Cj{A?)  (20) 


(although  this  can  certainly  be  violated  since  the  true  dynamics  do  not  necessarily  corre¬ 
spond  to  those  assumed  by  the  controllers). 

Let  us  suppose  (19), (20)  hold.  Even  in  this  situation,  a  commander  (with  no  a  priori 
preferences)  could  still  construct  an  objective  criterion  for  choosing  between  two  control 
algorithms,  A1  and  A2.  One  obvious  approach  is  to  consider 


C\Al)  C2(A2) 

C\A*)  and  C^A1) 


(21) 


To  eliminate  technicalities,  let  us  assume  that  all  four  of  these  costs  in  the  ratios  are 
nonzero.  Then,  if  (19),  (20)  hold, 


C\Al) 

C1{A2Y 


£2(A2) 

CYA1) 


(0,1]. 


Then  one  can  simply  compute 


^  Cl(Al)  C2(A2) 

1,2  “  C^A2)  C2(AlY 

and  one  would  choose  A1  if  d\  ,2  <  0  and  vice-versa  (with  no  preference  if  =  0  of 
course). 

However,  if  the  number  of  controllers  exceeds  two,  one  can  show  that  there  may  not 
be  an  optimal  choice  by  this  approach.  That  is,  following  the  above  procedure  in  the  case 
of  three  controllers,  one  may  construct  a  situation  where  di;2  <  0,  d2, 3  <  0,  and  d3)i  <  0. 
Thus,  this  procedure  is  not  useful. 

On  the  other  hand,  in  general,  one  can  form  a  single  criterion  by  which  to  judge  the 
controllers  from  the  set  of  criteria  provided  by  the  control  designers,  and  can  then  evaluate 
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each  control  algorithm  according  to  this  combination  criterion.  Two  obvious  combination 
criteria  are  (in  the  case  of  N  controllers)  as  follows. 

n„AA')  =  ,max  tfM')  =  mas .E{L>\MHX?(-)),  [AT'WO)]}, 

1<j<N  1  <J<N 

(where  expectation  is  included  in  case  the  system  is  stochastic)  and 

N  N 

=  Y.V(A‘)  =  £  E{LHMHx?{-)),  [.ap']-1  «(•))]}■ 

3  =  1  3= 1 

The  optimal  control  algorithms  for  these  two  combination  criteria  are  obviously  given  by 
Amax  =  argmin  Umax  (A )  and  Asum  =  argmin  Usum  (A). 

i  i 

Note  that  if  the  ranges  of  the  O  are  very  different,  one  may  choose  to  first  re-scale  the 
range  by  CP  {to,  x™,  A1)  =  O(t0,  x™,  ^4')/[maxj;»{P(to,  x™,  A^)}}  so  as  to  normalize  the 
payoffs. 


7.2  Switching  Meta— Controller 

One  may  adapt  the  above  optimization  over  control  algorithms  approach  in  order  to 
develop  a  switching  meta-controller.  A  few  more  definitions  will  be  required. 

In  the  previous  section,  a  fixed  control  algorithm  was  chosen  at  the  initial  time,  to, 
and  this  remained  fixed  until  tw.  Now  we  will  allow  the  choice  of  control  algorithm  to 
switch  as  time  moves  forward.  Let  Ml ,  Ad* ,Fl ,ul ,Nl ,J\fl ,uf  ,Xf  be  defined  as  before. 
Since  this  paper  is  introductory  in  nature,  let  us  suppose  that  the  (possible)  switching 
times  are  prespecified  as  {tk}k= o  with  tk+i  =  4  +  A  for  k  >  0  and  to  still  being  the 
initial  time.  Also,  let  K  >  T /  At  where  we  recall  rw  <T  regardless  of  control  choice  (and 
sample  path  in  the  stochastic  case).  In  the  case  of  the  discrete-time  dynamics  of  (8),  (10), 
we  suppose  the  times  4  coincide  with  the  time-steps  given  in  the  dynamics. 

Let  us  again  first  consider  the  case  where  the  commander  or  planner  has  a  specified 
cost  criterion  in  mind.  This  would  take  the  form  Lw(Xw([t0,  tw]),  uw([t0,  tw])-  We  now 
modify  this  by  fixing  any  4  <  tw,  and  letting  the  cost  accrued  up  to  time  4  (i-e-  over 
[4, 4) )  be  denoted  by 

LI  ( Xw{[to ,  Tw]),uw([t0,  TW])  =  LI  (Xw([to,  4  A  TW)),  uw([to,  4  A  TW))), 
and  the  remaining  cost  to  go  over  [4,  tw]  be  denoted  by 

«•"([*»,  T*’])  =  rryY”([4,r”]),U”([4,T*’])) 

where  L™k(Xw([tk,  r®]),  nw([4,  tw\)  =  0  if  4  >  A.  Let  K  =  sup{/c  <  N  :  4  <  tw }•  Let 
L  =  [4,4+i)  if  k  <  K  and  Ik  =  \Pk-,tw]-  Also,  (assuming  4  <  tw)  let  the  cost  accrued 
over  time  interval  4  be  denoted  by 

Ll  (A-*’([<„,  r”]), «”  ([i„,  r'"])  =  LI  (A”(4).  u”(It)) 
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where  to  be  more  specific,  this  will  be  the  cost  over  [tk,  4+i)  if  bc+i  <  tW  and  over  [tk,  tw] 
otherwise.  We  assume 


Lfk(Xw([t0,tk  A  rw)),uw([t0,tk  Atw))  +  Ltk  (Xw([tk,Tw]),uw([tk,Tw]) 

=  Lw(Xw([t0,Tw}),uw([t0,Tw})  ( A1 ) 

for  all  k.  Then,  by  (Al), 

««([*„,  r*1])  =  {Y,Ll(X'°(It),u'‘(It))\. 


Let 


*o  G  argmin £w  (t0,  x™ ,  A1)  =  argmmE{L™o(Xw([t0,Tw]),uw([t0,Tw]))}. 

i  i 

Suppose  the  control  algorithm  so  chosen  at  time  to  is  denoted  by  At0  with  resulting  control 
ufo .  We  define  vA  over  the  time  interval  [to,  t\  A tw)  as  u^([to,  ti  A  rw))  =  uf0([to,  t\  A r”')). 
Similarly,  the  resulting  trajectory  is  Xw ([to,  ti  A  tw)).  For  each  j  and  t  G  [to,  ti  A  tw),  one 
may  then  obtain  the  image  of  this  trajectory  under  controller  j’s  map,  AP,  as  X3  (t)  = 
Mi(Xw(t)). 

Although,  the  goal  here  is  to  lay  out  some  general  beginnings  for  the  theory,  in  order 
to  be  more  clear  let  us  presume  here  that  we  have  a  stochastic  system  (as,  for  instance, 
given  by  (9)).  We  assume 

Xw(tk )  =  limX^s)  V k  >  0,  a.s. 


Denote  the  control  algorithm  chosen  for  this  first  time  interval  as  Ai0  =  At0.  Then, 
using  (A2),  Ai0  defines  Xw  over  [t0,ti  A  tw\,  and  let  us  suppose  tw  >  p.  Then,  for  time 
interval  T,  we  may  define,  for  each  controller  i,  ul(t)  =  Fz  o  Ml[X™(t)\  and  so  feedback 


u. 


'(*) 


AC_ 

iVxw(/0)uxy([n,b) 


ofo  M^Xfit)}  for  all  t  G  h,  where  Xf(tx)  =  X (p).  Let 


A  G  MgmmEti  ^{ti){Ll(X™([t0,Tw]),uy([t0,Tw]))} 


where  the  subscripts  on  the  expectation  indicate  that  it  is  conditioned  on  the  state  at 
time  ti.  Then  let 


and 


Xw(t) 

uw(t) 


A 


/qU/i 


if  t  G  Iq, 

if  t  G  I\ 

if  t  G  Iq, 

Ki*) 

if  t  G  I\ . 

Aio(t) 

if  t  G  /o, 

Ah  ( t ) 

if  t  G  I\. 

Lastly,  let 


Proceeding  inductively,  one  obtains  the  switching  meta-controller,  A,  over  the  entire 
period.  We  denote  the  cost  as  Cw(t0,  x™,  A)  or  simply  CW(A). 

The  following  theorem  is  straightforward  to  prove. 

Theorem  7.1  CW{A)  <  Cw{At)  for  all  i  6  {1,2, ...,  N}. 


Proof. 


CW(A)  =  Eto^{L?K(Xw(I0  U  h  U  . . .  U  IK-i),uw(l o  U  h  U  •  •  •  U  IK_X)) 


k= 0 

E,k ,X* )  { K  (V”  (/*),<„(/*))}} 


mm 

1  <iK<N 


K- 2 

E 

A;=0 


+  min  min  Et  ^  di”  (Xf  (IK-i),uf  (IK- 1)) 
l<iK-l<iV  l<iif  tK~1’X  dK-l)  L  Pc-lV  '' 

<.(«)}} 

<£<„,*?{  E£SrCr’(A),5"(A)) 

/c=0 

+  u  u  M)}}, 

and  continuing  this  process 

-  1  E,,"x~  W  (  A  ‘"  ( t£°  ’  T”l)  ’  “<o  ( ito  •  T”i )  *  1 

<r”0‘) 


In  the  (more  likely)  case  where  the  commander  or  planner  does  not  have  an  a  priori 
choice  for  Lw ,  one  can  proceed  similarly,  but  with  the  modification  induced  by  using  the 
combination  criteria  (such  as  Hmax)  in  place  of  Lw  as  described  at  the  end  of  the  previous 
section.  We  do  not  include  the  details  of  the  trivial  modifications.  Also,  one  obvious 
extension  of  this  would  be  to  allow  control  algorithm  switching  at  any  time;  this  will  not 
be  pursued  here,  but  left  for  later  papers/researchers. 


7.3  Game  Models 

In  the  above  analysis,  the  models  were  restricted  to  deterministic  and  stochastic  dynam¬ 
ics.  In  the  C2  arena,  game  models  are  also  certainly  appropriate,  and  so  we  make  a  few 
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comments  in  this  section.  The  techniques  described  above  may  be  modified  in  obvious 
ways  to  deal  with  such  models.  Although,  one  could  consider  both  deterministic  and 
stochastic  games,  as  well  as  continuous  time/space  and  discrete  time/space  models,  let  us 
restrict  the  discussion  here  say,  a  simple  continuous  time/space  deterministic  real-world 
dynamics  model.  We  suppose  there  is  an  antagonistic  player  in  the  real-world  attempting 
to  maximize  the  criterion  the  controller  or  planner  is  trying  to  minimize.  (Similar  models 
might  be  used  as  the  control  algorithms.)  For  instance,  suppose  the  dynamics  are 

dXw 

—  =  /(t,.Y">“>”) 

x :”((„)  =  x's 

where  uw  remains  our  control,  but  now  the  dynamics  are  affected  by  the  controls  of 
the  antagonistic  player  denoted  by  vw .  Suppose  for  instance,  that  one  has  restrictions 
in  the  real-world  of  uw  £  UMu  =  {uw  £  C[t0,  oo)  :  \uw(t)\  <  Mu  \/t  £  [t0,oo)}  and 
vw  £  yMv  _l_  | vw  e  (7[£0,oo)  :  \vw (t') |  <  Mv  Vf  £  [fo,oo)}.  Assume  that  /  is  sufficiently 
well-behaved  to  guarantee  existence  and  uniqueness  for  all  controls  in  UMu  and  VMv  for 
all  t  £  [to,  oo)  (or  at  least  for  all  t  £  [to,  r”]). 

A  standard  cost  criterion  might  take  the  form 


Cw(t0,x%,Ai)=  sup 

yW  (ZyMV 


F(B,X^B),uT(s),v'\s))dB  +  i,(Xr(n) 


Analyses  similar  to  those  of  the  previous  sections  yield  optimal  meta-controllers  and 
switching  meta-controllers.  We  will  not  pursue  this  further. 


7.4  Example 

The  purpose  of  this  section  is  to  indicate  that  the  concept  of  a  meta-controller  based 
on  choosing  among  a  possible  set  of  controllers  is  not  entirely  vacuous.  We  will  use  an 
example  to  indicate  this.  Again,  since  this  paper  is  only  intended  to  introduce  the  area 
as  a  possible  held  of  study,  the  example  will  be  rather  simple.  However,  we  will  allow  the 
models  to  vary  quite  significantly  from  the  original  system,  so  that  nontrivial  behaviors 
will  be  manifest.  In  a  real  world  system,  it  would  be  expected  that  the  system  models 
used  by  the  controllers  would  be  tied  down  somewhat  by  the  realities  of  the  true  system. 

Consider  a  simple  discrete-time  stochastic  real-world  model  given  as  follows.  Let 
Xw  =  {0, 1,  2, 3}.  Let  the  dynamics  be 

(  f(Xw(tk),uw(tk),Ww(tk)) 

yw(,  ,_  \  =  Xw(tk)  +  uw(tk)  +  Ww (4)  if  f(Xw{tk),  uw(tk),  Ww(tk ))  £  Xw, 

[  k+l)  I  o  if  f(Xw(tk),  uw(tk),  Ww(tk))  <  0, 

.  3  if  f(Xw(tk),  uw(tk),  Ww(tk ))  >  3, 
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where  uw  £  {0,  —1,  —2}  and  Ww  is  a  time-uncorrelated  random  process  with  P(Ww{tk)  = 
1)  =  1/2  and  P(Ww{tk)  =  0)  =  1/2.  (We  set  4+i  —  4  =  1.)  We  will  not  suppose  an  exit 
set  for  the  commander,  but  a  fixed  exit  time  of  T  =  20. 

Let  there  be  three  control  algorithms  where  X1  =  X2  =  X3  =  Xw ,  and  let  Ml  be  the 
identity  for  all  i.  Also  let  U1  =  U2  =  U3  =  Uw  with  Nl  the  identity  as  well.  Suppose 
however  that  the  dynamics  for  the  three  models  are 


X1(4+i) 


f1(X1(tk),u1(tk),W1(tk)) 

=  Xl(tk)  +  2ul(tk)+W1(tk) 

0 

.3 


if  f1(X1(tk),  w1(4),  Wl{tk))  e  Xw, 
if  fl(Xl(tk),  n1(4),  Wl(tk))  <  0, 
if/1(X1(4),w1(4),^1(4))  >3, 


A2(4+i) 


X3(tk+1) 


f2(X2(tk),u2(tk),W2(tk)) 

=  X2(tk)  +  u2(tk)  +  2W2(tk) 

0 

„  3 


if  f2{X2(tk),  u2(tk),  W2(tk))  e  xw, 
if  f2(X2(tk),u2(tk),W2(tk))  <  0, 
if /2(X2(4),u2(4),  W2(4))  >  3, 


( f3(X3(tk),u3(tk),W3(tk)) 

=  A3(4)  +  n3(4)  +  l^3(4) 

0 

,3 


if  f3(X3(tk),u3(tk),W3(tk))  e  Xw, 
if  /3(A3(4),  n3(4),  W3(4))  <  0, 
if  f3(X2(tk),u3(tk),W3(tk))  >  3, 


with  the  probabilities  for  IF1,  W2  and  W3  being  identical  to  those  for  Ww . 


Suppose  the  commander  or  planner  does  not  have  an  a  priori  cost  criterion.  Let  the 
criteria  proposed  by  the  controllers  be 


i1  =  E  jf;  (i/  [(Ayy)2  +  3|«1(**)l]} 
£2  =  «{e  (5)*  [(-v2(s*))2  +  |«2(ft)|]} 
L3  =  s4(As(r3))  +  f:8|tl3(4)l| 


where  in  the  last  cost  criterion,  the  exit  set  is  {0,3}  with  H/(0)  =  0  and  'L(3)  =  16,  r3  is 
the  corresponding  exit  time,  and  K3  is  the  last  index  prior  to  exit. 

The  corresponding  dynamic  programming  equations  are  as  follows.  For  the  first  con¬ 
troller,  it  is 


V1(x)  =  x2  +  imn|3|u|  +  ^E[V1(f1(x,  u,  re))]}  Vx  £  {0, 1,  2,  3}. 
Similarly,  for  the  second  controller  it  is 

V2{x)  =  x2  +  min||n|  +  ^-E[V2(f2(x,  u,  to))]}  Vx  £  {0, 1,  2,  3}. 
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For  the  third  controller,  one  automatically  has  V3(0)  =  0  and  V3(3)  =  16.  For  the 
remaining  states,  the  dynamic  programming  equation  is 

V3(x )  =  x2  +  min{8|u|  +  E[V3(f3(x,  u,  re))]}  Vr  6  {1,  2}. 

Solving  the  dynamic  programming  equations  for  the  first  control  model,  one  finds  that 
the  value  function  and  optimal  feedback  control  are 

i/1(o)  =  5/3,  y1(i)  =  5,  y1(2)  =  ii,  y1(3)  =  i8 

and 

F1(0)  =  0,  F1(1)  =  0,  Fl{  2)  =  -l,  F1(3)  =  0. 

Proceeding  similarly  for  the  second  control  model,  one  obtains  the  value 

1/2(0)  =  3,  1/2(1)  =  5,  1/2(2)  =  9,  1/2(3)  =  49/3. 

In  this  case,  there  are  multiple  optimal  feedback  controls;  we  will  choose 

F2(0)  =  0,  F2(l)  =  -1,  F2(  2)  =  -2,  F2(3)  =  -2. 

Lastly,  for  the  third  control  model,  one  finds 

1/3(0)  =  0,  V3(l)  =  16,  1/3(2)  =  16,  1/3(3)  =  16. 

Note  that  the  controls  are  not  well-defined  for  the  exit  state  (where  we  recall  the  true 
exit  time  may  differ  from  the  controllers),  and  so  we  will  arbitrarily  choose  controls  for 
the  points  in  the  exit  set.  We  obtain  the  feedback  control 

F3(0)  =  -1,  F3(l)  =  -1,  F3(2)=0,  F3(3)=0. 

Note  that  the  above  value  functions  are  not  necessarily  identical  to  £*(£),  x™,  A1)  since 
the  dynamics  used  in  propagating  the  true  state  may  differ  from  that  in  the  models.  Also, 
note  that  the  dynamics  and  criteria  were  time-independent  in  this  problem. 

Once  one  has  determined  the  feedback  controllers  (and  given  that  Ml  and  Nl  are  iden¬ 
tities),  it  is  easy  to  determine  the  associated  costs  via  dynamic  programming  equations. 
We  present  the  results  in  tabular  form. 


rrW 

C^A1) 

C\A2) 

C\A3) 

£2( A :) 

£2(^2) 

£2(^3) 

£3(^) 

CO 

'O 

CO 
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0 

1.667 

2 

6 

0.5 

1 

2 

0 

0 

0 

l 

5 

6 

7.333 

1.5 

3 

3.333 

76 

16 

16 

2 

11 

12 

11.333 

7.167 

7 

11.333 

84 

24 

16 

3 

18 

19.5 

18 

18 

13.5 

18 

16 

16 

16 
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Using  these  results,  one  is  able  to  compute  rfi;2,  di,3  and  d2, 3  as  discussed  in  Section 
7.1.  Using  di;2  yields  a  preference  of  A1  over  A2  if  Xg  =  0,1,2,  and  vice-versa  for  the 
other  state.  Using  di;3  yields  a  preference  of  A1  over  A3  if  Xg  =  0,  vice-versa  for  Xg  =  2,  3, 
and  no  preference  for  Xg  =  3.  Using  d2) 3  yields  a  preference  of  A2  over  A 3  everywhere. 
Comparing  these  three  sets  of  results,  one  sees  that  A1  is  the  preference  if  Xq  =0,  and 
that  A2  is  the  preference  if  Xg  =  3.  However,  when  Xg  =  1,2,  one  sees  that  these  pairwise 
preferences  do  not  lead  to  any  single  overall  preference  among  the  three. 

We  now  proceed  to  compute  the  combination  criteria  'Hsum  and  'Hmax- 
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/Hsum{A1) 

nSum(A2) 

'Hsum(^) 

'Hmax{Al) 

nmax(A2) 

CO 

e 

£ 
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3 
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84 
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26.667 

76 

16 

16 
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102.167 
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84 

24 
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18 

Using  combination  criterion  'Hsum  in  the  controller  optimization  method  of  Section  7.1 
leads  to  a  choice  of  A1  if  Xg  =  0,  A2  if  Xg  =  1,  A3  if  x™  =  2,  and  A2  if  Xg  =  3. 

We  also  note  that  the  above  choices  could  be  used  in  a  feedback  form  to  yield  a 
switching  meta-controller  as  described  in  Section  7.2.  Noting  that  the  choice  at  each 
switching  time  depends  only  on  the  remaining  cost  to  come,  one  sees  that  one  needs  to 
compute  also  /HSUm  over  controllers  1  and  2  for  t  >  r3;  to  avoid  confusion,  let  us  denote 
this  as  'Hlum  This  is  given  by 


rrW 

X0 

0 

2.167 

3 

8 

1 

6.5 

9 

10.667 

2 

18.5 

19 

22.667 

3 

36 

33 

36 

The  resulting  switching  meta-control  is  identical  to  the  optimized  control  choice  above 
for  t  <  t3,  that  is  one  chooses  A1  if  xw  =  0,  A2  if  xw  =  1,  A3  if  xw  =  2,  and  A2  if  xw  =  3. 
In  other  words,  one  obtains  the  real  world  feedback  given  by  11^(0)  =  0,  11^(1)  =  — 1, 
nu’(2)  =  0,  and  rU(3)  =  —2.  For  t  >  r3,  one  obtains  .A1  if  xw  =  0,  .A1  if  xw  =  1,  A1  if 
xw  =  2,  and  A2  if  xw  =  3.  In  other  words,  one  obtains  the  real  world  feedback  given  by 
IU(0)  =  0,  rU(l)  =  0,  uw( 2)  =  —1,  and  uw{ 3)  =  —2  after  r3.  It  is  not  difficult  to  check 
that  the  switching  controller  satisfies  the  statement  of  Theorem  7.1. 
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One  can  also  obtain  the  optimized  and  switching  controllers  for  'Hmax.  These  differ 
from  those  for  %surn  only  at  xw  =  3.  Specifically,  the  optimized  controller  choice  and  the 
switching  controller  (for  all  t)  have  uw( 3)  =  0. 
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